Tika - handling custom encoding PDF files

Nicholas DiPiazza Mon, 21 Jun 2021 12:18:34 -0700

Let's say we have a PDF with a bunch of custom encodings. they would look
like this in your Font Properties:


[image: image.png]

Notice those with "encoding: custom".

So even though the PDF has normal looking hebrew text such as:

[image: image.png]
When you copy it to clipboard it looks like this:

©°³ ž ³ž©¤³

That's because the custom encoding does not actually map to UTF-8
characters.

Has anyone heard of a way to magically process these custom encodings to
find a reasonable UTF-8 mapping?

I'm not even sure how that would be possible, but I figured I'd just reach
out and see how ya'll out there in the wild have handled custom encodings.

In particular, i want to index my PDFs into Solr but doing so is completely
useless because the custom encodings index as complete gibberish.

Any ideas?

-Nicholas

Tika - handling custom encoding PDF files

Reply via email to