[ https://issues.apache.org/jira/browse/PDFBOX-5790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler resolved PDFBOX-5790. ---------------------------------------- Fix Version/s: 2.0.32 4.0.0 3.0.3 PDFBox Resolution: Fixed > Don't use a predefined CMap if a ToUnicode CMap is present > ---------------------------------------------------------- > > Key: PDFBOX-5790 > URL: https://issues.apache.org/jira/browse/PDFBOX-5790 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.31, 4.0.0, 3.0.3 PDFBox > Reporter: Andreas Lehmkühler > Assignee: Andreas Lehmkühler > Priority: Major > Fix For: 2.0.32, 4.0.0, 3.0.3 PDFBox > > Attachments: p4_fix.pdf > > > The user Luiz Marcelo Modesto reported an issue with the text extraction of > the attached pdf [^p4_fix.pdf] > {quote} > Hi everyone, > I'm not sure if this is the same as FAQ "How come I am getting > gibberish(G38G43G36G51G5) when extracting text?"... > I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment (build > 11.0.22+7-post-Ubuntu-0ubuntu222.04.1). > I'm trying to understand how this PDF chunk (from p4_fix.pdf attached) > BT > /G1F7 6.0 Tf > 94.871 773.806 Td > <004200430044> Tj > ET > becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe Reader, > Chrome, ...) and becomes "abc" on PDFBox text extraction tool. > Using the Poppler pdftotext (version 22.02.0) gives me "BCD" too. > The renders that allow me to copy the text give me "BCD" text. > It seems that PDFBox extraction tool follows the item "9.10.2 Mapping > character codes to Unicode values" (ISO 32000-2:2020) but all the others > choose a different way. > Could you help me to understand if there is a problem with the PDF file, > with the renders or with the extract text tool? > Thank you! > {quote} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org