Marc Reichman created PDFBOX-6054:
-------------------------------------
Summary: Enable API support to check when text is scrambled and/or
if some of the unicode mapping warnings happen
Key: PDFBOX-6054
URL: https://issues.apache.org/jira/browse/PDFBOX-6054
Project: PDFBox
Issue Type: Improvement
Components: Text extraction
Affects Versions: 3.0.5 PDFBox
Environment: Linux / JDK 21 / Docker
Windows / JDK 21
Reporter: Marc Reichman
Attachments: 7E32D4EAD8382000E24D9967C1913F6E.pdf
With the attached PDF, there is plenty of gibberish in the text extraction. I
have seen other issues mention this, but in this particular case it displays
perfectly fine in Edge or Chrome. I have opened it in the pdf debugger but it's
hard to figure out what I'm looking at.
The pdftotext tool from xpdf generates the same. Interestingly, the pdffonts
tool does not show any fonts as "problem".
I understand this will happen and it's due to pdf generation bugs, not
including proper unicode translators, etc. but I am curious, could we check a
property or get a specific exception when unicode mapping is not available? I'm
not sure if that's overcorrective; i.e. unicode mapping failures is a way of
normal life.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]