Hi,
I don't see anything.
Not all PDFs can be extracted. Re "magically process", no:
https://stackoverflow.com/questions/39485920/how-to-add-unicode-in-truetype0font-on-pdfbox-2-0-0
Tilman

Am 21.06.2021 um 21:18 schrieb Nicholas DiPiazza:
Let's say we have a PDF with a bunch of custom encodings. they would look like this in your Font Properties:

image.png

Notice those with "encoding: custom".

So even though the PDF has normal looking hebrew text such as:

image.png
When you copy it to clipboard it looks like this:

©°³ ž ³ž©¤³

That's because the custom encoding does not actually map to UTF-8 characters.

Has anyone heard of a way to magically process these custom encodings to find a reasonable UTF-8 mapping?

I'm not even sure how that would be possible, but I figured I'd just reach out and see how ya'll out there in the wild have handled custom encodings.

In particular, i want to index my PDFs into Solr but doing so is completely useless because the custom encodings index as complete gibberish.

Any ideas?

-Nicholas


Reply via email to