[
https://issues.apache.org/jira/browse/PDFBOX-5790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler updated PDFBOX-5790:
---------------------------------------
Description:
The user Luiz Marcelo Modesto reported an issue with the text extraction of the
attached pdf [^p4_fix.pdf]
{quote}
Hi everyone,
I'm not sure if this is the same as FAQ "How come I am getting
gibberish(G38G43G36G51G5) when extracting text?"...
I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment (build
11.0.22+7-post-Ubuntu-0ubuntu222.04.1).
I'm trying to understand how this PDF chunk (from p4_fix.pdf attached)
BT
/G1F7 6.0 Tf
94.871 773.806 Td
<004200430044> Tj
ET
becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe Reader,
Chrome, ...) and becomes "abc" on PDFBox text extraction tool.
Using the Poppler pdftotext (version 22.02.0) gives me "BCD" too.
The renders that allow me to copy the text give me "BCD" text.
It seems that PDFBox extraction tool follows the item "9.10.2 Mapping
character codes to Unicode values" (ISO 32000-2:2020) but all the others choose
a different way.
Could you help me to understand if there is a problem with the PDF file,
with the renders or with the extract text tool?
Thank you!
{quote}
was:
The user Luiz Marcelo Modesto reported an issue with the text extraction of the
attached pdf
{quote}
Hi everyone,
I'm not sure if this is the same as FAQ "How come I am getting
gibberish(G38G43G36G51G5) when extracting text?"...
I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment (build
11.0.22+7-post-Ubuntu-0ubuntu222.04.1).
I'm trying to understand how this PDF chunk (from p4_fix.pdf attached)
BT
/G1F7 6.0 Tf
94.871 773.806 Td
<004200430044> Tj
ET
becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe Reader,
Chrome, ...) and becomes "abc" on PDFBox text extraction tool.
Using the Poppler pdftotext (version 22.02.0) gives me "BCD" too.
The renders that allow me to copy the text give me "BCD" text.
It seems that PDFBox extraction tool follows the item "9.10.2 Mapping
character codes to Unicode values" (ISO 32000-2:2020) but all the others choose
a different way.
Could you help me to understand if there is a problem with the PDF file,
with the renders or with the extract text tool?
Thank you!
{quote}
> Don't use a predefined CMap if a ToUnicode CMap is present
> ----------------------------------------------------------
>
> Key: PDFBOX-5790
> URL: https://issues.apache.org/jira/browse/PDFBOX-5790
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.31, 4.0.0, 3.0.3 PDFBox
> Reporter: Andreas Lehmkühler
> Assignee: Andreas Lehmkühler
> Priority: Major
> Attachments: p4_fix.pdf
>
>
> The user Luiz Marcelo Modesto reported an issue with the text extraction of
> the attached pdf [^p4_fix.pdf]
> {quote}
> Hi everyone,
> I'm not sure if this is the same as FAQ "How come I am getting
> gibberish(G38G43G36G51G5) when extracting text?"...
> I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment (build
> 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).
> I'm trying to understand how this PDF chunk (from p4_fix.pdf attached)
> BT
> /G1F7 6.0 Tf
> 94.871 773.806 Td
> <004200430044> Tj
> ET
> becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe Reader,
> Chrome, ...) and becomes "abc" on PDFBox text extraction tool.
> Using the Poppler pdftotext (version 22.02.0) gives me "BCD" too.
> The renders that allow me to copy the text give me "BCD" text.
> It seems that PDFBox extraction tool follows the item "9.10.2 Mapping
> character codes to Unicode values" (ISO 32000-2:2020) but all the others
> choose a different way.
> Could you help me to understand if there is a problem with the PDF file,
> with the renders or with the extract text tool?
> Thank you!
> {quote}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]