[jira] [Resolved] (PDFBOX-5790) Don't use a predefined CMap if a ToUnicode CMap is present

Jira Mon, 25 Mar 2024 11:36:55 -0700


     [ 
https://issues.apache.org/jira/browse/PDFBOX-5790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andreas Lehmkühler resolved PDFBOX-5790.
----------------------------------------
    Fix Version/s: 2.0.32
                   4.0.0
                   3.0.3 PDFBox
       Resolution: Fixed

> Don't use a predefined CMap if a ToUnicode CMap is present
> ----------------------------------------------------------
>
>                 Key: PDFBOX-5790
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5790
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.31, 4.0.0, 3.0.3 PDFBox
>            Reporter: Andreas Lehmkühler
>            Assignee: Andreas Lehmkühler
>            Priority: Major
>             Fix For: 2.0.32, 4.0.0, 3.0.3 PDFBox
>
>         Attachments: p4_fix.pdf
>
>
> The user Luiz Marcelo Modesto reported an issue with the text extraction of 
> the attached pdf  [^p4_fix.pdf] 
> {quote}
> Hi everyone,
>     I'm not sure if this is the same as FAQ "How come I am getting 
> gibberish(G38G43G36G51G5) when extracting text?"...
>     I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment (build 
> 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).
>     I'm trying to understand how this PDF chunk (from p4_fix.pdf attached)
>   BT
>   /G1F7 6.0 Tf
>   94.871 773.806 Td
>   <004200430044> Tj
>   ET
>     becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe Reader, 
> Chrome, ...) and becomes "abc" on PDFBox text extraction tool. 
>     Using the Poppler pdftotext (version 22.02.0) gives me "BCD" too.
>     The renders that allow me to copy the text give me "BCD" text.
>     It seems that PDFBox extraction tool follows the item "9.10.2 Mapping 
> character codes to Unicode values" (ISO 32000-2:2020) but all the others 
> choose a different way.
>      Could you help me to understand if there is a problem with the PDF file, 
> with the renders or with the extract text tool? 
> Thank you!
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (PDFBOX-5790) Don't use a predefined CMap if a ToUnicode CMap is present

Reply via email to