Marc Reichman created PDFBOX-6054: ------------------------------------- Summary: Enable API support to check when text is scrambled and/or if some of the unicode mapping warnings happen Key: PDFBOX-6054 URL: https://issues.apache.org/jira/browse/PDFBOX-6054 Project: PDFBox Issue Type: Improvement Components: Text extraction Affects Versions: 3.0.5 PDFBox Environment: Linux / JDK 21 / Docker Windows / JDK 21 Reporter: Marc Reichman Attachments: 7E32D4EAD8382000E24D9967C1913F6E.pdf
With the attached PDF, there is plenty of gibberish in the text extraction. I have seen other issues mention this, but in this particular case it displays perfectly fine in Edge or Chrome. I have opened it in the pdf debugger but it's hard to figure out what I'm looking at. The pdftotext tool from xpdf generates the same. Interestingly, the pdffonts tool does not show any fonts as "problem". I understand this will happen and it's due to pdf generation bugs, not including proper unicode translators, etc. but I am curious, could we check a property or get a specific exception when unicode mapping is not available? I'm not sure if that's overcorrective; i.e. unicode mapping failures is a way of normal life. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org