Bob Swanson created PDFBOX-4250:
-----------------------------------

             Summary: PDF File with embedded fonts: text extraction fails or 
returns junk characters
                 Key: PDFBOX-4250
                 URL: https://issues.apache.org/jira/browse/PDFBOX-4250
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 2.0.9
            Reporter: Bob Swanson


 One of the people that I support created a PDF file from an LibreOffice 
document, and then misplaced the original document. I believed that I could use 
PDFBox to extract the text from the PDF, and at least provide that information 
to the user.



When I ran the text extractor from the "app" jar, on their PDF file I got the  
following types of messages (many):

...
Jun 13, 2018 5:38:43 PM
org.apache.pdfbox.pdmodel.font.PDSimple
ont toUnicode
WARNING: No Unicode mapping for 7 (7) in
font EXIRGE+Ubuntu
Jun 13, 2018 5:38:43 PM
org.apache.pdfbox.pdmodel.font.PDSimpleont toUnicode
WARNING: No Unicode mapping for 8 (8) in
font EXIRGE+Ubuntu
Jun 13, 2018 5:38:43 PM
org.apache.pdfbox.pdmodel.font.PDSimple
ont toUnicode
WARNING: No Unicode mapping for 1 (1) in
font JTPICY+AndaleMono
Jun 13, 2018 5:38:43 PM
org.apache.pdfbox.pdmodel.font.PDSimple
ont toUnicode
...

The resulting "txt" file is just binary numbers, unless the font is one of the 
"standard". I ran
the debugger on the PDF file and saw that several fonts were embedded, and thus 
used low numbers for encoding (1,2,3, etc).



When viewed, the PDF file looks good, but nothing can be copied or pasted from 
the display (again,standard font seems OK).



The original file was of a sensitive nature, so I was able to re-create the 
problem with a simpler file.


Running on Ubuntu 16.04

LibreOffice was used to "print" on the cups-pdf "printer" (which may  be part 
of the problem).



Text extract was attempted with pdfbox-app-2.0.9.jar



PDF file is at:



http://swansongrp.com/misc/mytest3.pdf





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to