Alfons created PDFBOX-5540:
------------------------------

             Summary: export:text creates jibberish / malformed output
                 Key: PDFBOX-5540
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5540
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 3.0.0 JBIG2
         Environment: Same on Windows, Linux and macOS
            Reporter: Alfons
         Attachments: test.pdf, test.txt

Using PDFBox as part of Tika and having issues with some PDFs outputting 
unreadable content. Copying text from Adobe / macOS Preview / Browsers works as 
expected.

I have also tried "re-encoding" the PDF by editing and saving it with Acrobat, 
thinking it could be an issue with their original PDF creator and using pdfbox 
with different encodings, but output mostly remained unchanged.

I attached the PDF and text it produces. Running it PDFBox via CLI as follows:
{code:java}
root % java -jar pdfbox-app-3.0.0-alpha3.jar export:text -i test.pdf          
Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
WARNUNG: Invalid ToUnicode CMap in font 
Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
WARNUNG: Using predefined identity CMap instead
Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
WARNUNG: Invalid ToUnicode CMap in font 
Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
WARNUNG: Using predefined identity CMap instead
Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
WARNUNG: Invalid ToUnicode CMap in font 
Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
WARNUNG: Using predefined identity CMap instead
Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
WARNUNG: Invalid ToUnicode CMap in font 
Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
WARNUNG: Using predefined identity CMap instead {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to