Alfons created PDFBOX-5540:
------------------------------
Summary: export:text creates jibberish / malformed output
Key: PDFBOX-5540
URL: https://issues.apache.org/jira/browse/PDFBOX-5540
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 3.0.0 JBIG2
Environment: Same on Windows, Linux and macOS
Reporter: Alfons
Attachments: test.pdf, test.txt
Using PDFBox as part of Tika and having issues with some PDFs outputting
unreadable content. Copying text from Adobe / macOS Preview / Browsers works as
expected.
I have also tried "re-encoding" the PDF by editing and saving it with Acrobat,
thinking it could be an issue with their original PDF creator and using pdfbox
with different encodings, but output mostly remained unchanged.
I attached the PDF and text it produces. Running it PDFBox via CLI as follows:
{code:java}
root % java -jar pdfbox-app-3.0.0-alpha3.jar export:text -i test.pdf
Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
WARNUNG: Invalid ToUnicode CMap in font
Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
WARNUNG: Using predefined identity CMap instead
Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
WARNUNG: Invalid ToUnicode CMap in font
Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
WARNUNG: Using predefined identity CMap instead
Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
WARNUNG: Invalid ToUnicode CMap in font
Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
WARNUNG: Using predefined identity CMap instead
Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
WARNUNG: Invalid ToUnicode CMap in font
Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
WARNUNG: Using predefined identity CMap instead {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]