[
https://issues.apache.org/jira/browse/PDFBOX-5790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler resolved PDFBOX-5790.
----------------------------------------
Fix Version/s: 2.0.32
4.0.0
3.0.3 PDFBox
Resolution: Fixed
> Don't use a predefined CMap if a ToUnicode CMap is present
> ----------------------------------------------------------
>
> Key: PDFBOX-5790
> URL: https://issues.apache.org/jira/browse/PDFBOX-5790
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.31, 4.0.0, 3.0.3 PDFBox
> Reporter: Andreas Lehmkühler
> Assignee: Andreas Lehmkühler
> Priority: Major
> Fix For: 2.0.32, 4.0.0, 3.0.3 PDFBox
>
> Attachments: p4_fix.pdf
>
>
> The user Luiz Marcelo Modesto reported an issue with the text extraction of
> the attached pdf [^p4_fix.pdf]
> {quote}
> Hi everyone,
> I'm not sure if this is the same as FAQ "How come I am getting
> gibberish(G38G43G36G51G5) when extracting text?"...
> I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment (build
> 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).
> I'm trying to understand how this PDF chunk (from p4_fix.pdf attached)
> BT
> /G1F7 6.0 Tf
> 94.871 773.806 Td
> <004200430044> Tj
> ET
> becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe Reader,
> Chrome, ...) and becomes "abc" on PDFBox text extraction tool.
> Using the Poppler pdftotext (version 22.02.0) gives me "BCD" too.
> The renders that allow me to copy the text give me "BCD" text.
> It seems that PDFBox extraction tool follows the item "9.10.2 Mapping
> character codes to Unicode values" (ISO 32000-2:2020) but all the others
> choose a different way.
> Could you help me to understand if there is a problem with the PDF file,
> with the renders or with the extract text tool?
> Thank you!
> {quote}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]