Hi,
I don't see anything.
Not all PDFs can be extracted. Re "magically process", no:
https://stackoverflow.com/questions/39485920/how-to-add-unicode-in-truetype0font-on-pdfbox-2-0-0
Tilman
Am 21.06.2021 um 21:18 schrieb Nicholas DiPiazza:
Let's say we have a PDF with a bunch of custom encodings. they would
look like this in your Font Properties:
image.png
Notice those with "encoding: custom".
So even though the PDF has normal looking hebrew text such as:
image.png
When you copy it to clipboard it looks like this:
©°³ ž ³ž©¤³
That's because the custom encoding does not actually map to UTF-8
characters.
Has anyone heard of a way to magically process these custom encodings
to find a reasonable UTF-8 mapping?
I'm not even sure how that would be possible, but I figured I'd just
reach out and see how ya'll out there in the wild have handled custom
encodings.
In particular, i want to index my PDFs into Solr but doing so is
completely useless because the custom encodings index as complete
gibberish.
Any ideas?
-Nicholas