Re: Tika - handling custom encoding PDF files

Tilman Hausherr Mon, 21 Jun 2021 19:28:42 -0700

Hi,
I don't see anything.
Not all PDFs can be extracted. Re "magically process", no:
https://stackoverflow.com/questions/39485920/how-to-add-unicode-in-truetype0font-on-pdfbox-2-0-0
Tilman


Am 21.06.2021 um 21:18 schrieb Nicholas DiPiazza:

Let's say we have a PDF with a bunch of custom encodings. they wouldlook like this in your Font Properties:
image.png

Notice those with "encoding: custom".

So even though the PDF has normal looking hebrew text such as:

image.png
When you copy it to clipboard it looks like this:

©°³ ž ³ž©¤³
That's because the custom encoding does not actually map to UTF-8characters.
Has anyone heard of a way to magically process these custom encodingsto find a reasonable UTF-8 mapping?
I'm not even sure how that would be possible, but I figured I'd justreach out and see how ya'll out there in the wild have handled customencodings.
In particular, i want to index my PDFs into Solr but doing so iscompletely useless because the custom encodings index as completegibberish.
Any ideas?

-Nicholas

Re: Tika - handling custom encoding PDF files

Reply via email to