Marc Reichman created PDFBOX-6054:
-------------------------------------

             Summary: Enable API support to check when text is scrambled and/or 
if some of the unicode mapping warnings happen
                 Key: PDFBOX-6054
                 URL: https://issues.apache.org/jira/browse/PDFBOX-6054
             Project: PDFBox
          Issue Type: Improvement
          Components: Text extraction
    Affects Versions: 3.0.5 PDFBox
         Environment: Linux / JDK 21 / Docker
Windows / JDK 21
            Reporter: Marc Reichman
         Attachments: 7E32D4EAD8382000E24D9967C1913F6E.pdf

With the attached PDF, there is plenty of gibberish in the text extraction. I 
have seen other issues mention this, but in this particular case it displays 
perfectly fine in Edge or Chrome. I have opened it in the pdf debugger but it's 
hard to figure out what I'm looking at.

 

The pdftotext tool from xpdf generates the same. Interestingly, the pdffonts 
tool does not show any fonts as "problem".

 

I understand this will happen and it's due to pdf generation bugs, not 
including proper unicode translators, etc. but I am curious, could we check a 
property or get a specific exception when unicode mapping is not available? I'm 
not sure if that's overcorrective; i.e. unicode mapping failures is a way of 
normal life.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to