[ https://issues.apache.org/jira/browse/PDFBOX-5540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tilman Hausherr resolved PDFBOX-5540. ------------------------------------- Assignee: Tilman Hausherr Resolution: Fixed > export:text creates jibberish / malformed output > ------------------------------------------------ > > Key: PDFBOX-5540 > URL: https://issues.apache.org/jira/browse/PDFBOX-5540 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.16, 2.0.27, 3.0.0 PDFBox > Environment: Same on Windows, Linux and macOS > Reporter: Alfons > Assignee: Tilman Hausherr > Priority: Minor > Labels: regression > Fix For: 2.0.28, 3.0.0 PDFBox > > Attachments: PDFBOX-5540.pdf.txt, test.pdf, test.txt > > > Using PDFBox as part of Tika and having issues with some PDFs outputting > unreadable content. Copying text from Adobe / macOS Preview / Browsers works > as expected. > I have also tried "re-encoding" the PDF by editing and saving it with > Acrobat, thinking it could be an issue with their original PDF creator and > using pdfbox with different encodings, but output mostly remained unchanged. > I attached the PDF and text it produces. Running it PDFBox via CLI as follows: > {code:java} > root % java -jar pdfbox-app-3.0.0-alpha3.jar export:text -i test.pdf > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Invalid ToUnicode CMap in font > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Using predefined identity CMap instead > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Invalid ToUnicode CMap in font > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Using predefined identity CMap instead > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Invalid ToUnicode CMap in font > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Using predefined identity CMap instead > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Invalid ToUnicode CMap in font > Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap > WARNUNG: Using predefined identity CMap instead {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org