[
https://issues.apache.org/jira/browse/PDFBOX-5540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17631240#comment-17631240
]
Tilman Hausherr edited comment on PDFBOX-5540 at 11/10/22 7:16 AM:
-------------------------------------------------------------------
It worked with 2.0.15 and stopped working with 2.0.16. It's likely connected to
workarounds related to broken /ToUnicode streams.
Release notes of 2.0.16:
[https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12310760&version=12345355]
Probable issue is PDFBOX-4550.
I tried to disable the (now very complex) workaround in
{{PDFont.loadUnicodeCmap()}} and then it works, so I guess that one has to be
fine-tuned once again.
was (Author: tilman):
It worked with 2.0.15 and stopped working with 2.0.16. It's likely connected to
workarounds related to broken /ToUnicode streams.
Release notes of 2.0.16:
[https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12310760&version=12345355]
Probable issue is PDFBOX-4550.
I tried to disable to (very complex) workaround in {{PDFont.loadUnicodeCmap()}}
and then it works, so I guess that one has to be fine-tuned once again.
> export:text creates jibberish / malformed output
> ------------------------------------------------
>
> Key: PDFBOX-5540
> URL: https://issues.apache.org/jira/browse/PDFBOX-5540
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 3.0.0 PDFBox
> Environment: Same on Windows, Linux and macOS
> Reporter: Alfons
> Priority: Minor
> Attachments: test.pdf, test.txt
>
>
> Using PDFBox as part of Tika and having issues with some PDFs outputting
> unreadable content. Copying text from Adobe / macOS Preview / Browsers works
> as expected.
> I have also tried "re-encoding" the PDF by editing and saving it with
> Acrobat, thinking it could be an issue with their original PDF creator and
> using pdfbox with different encodings, but output mostly remained unchanged.
> I attached the PDF and text it produces. Running it PDFBox via CLI as follows:
> {code:java}
> root % java -jar pdfbox-app-3.0.0-alpha3.jar export:text -i test.pdf
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Invalid ToUnicode CMap in font
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Using predefined identity CMap instead
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Invalid ToUnicode CMap in font
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Using predefined identity CMap instead
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Invalid ToUnicode CMap in font
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Using predefined identity CMap instead
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Invalid ToUnicode CMap in font
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Using predefined identity CMap instead {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]