[
https://issues.apache.org/jira/browse/PDFBOX-3053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14972705#comment-14972705
]
Tilman Hausherr edited comment on PDFBOX-3053 at 10/24/15 4:57 PM:
-------------------------------------------------------------------
The reason that it works with the PDFBOX-2959-reduced.pdf is that this one has
the character codes in the PDF identical to the text. In the "new" file, the
codes are hex 1, 2, 3, 4 etc and that is what is extracted. Encoding:
3053 file (is ignored):
{code}
/Differences [1 /space /A /n /a /d /u /l /t /s /h /o /g /e /m /r /c /y /p /i
/slash /f /colon /w /b /j /v /comma /k /period]
{code}
2959 file (is 1:1):
{code}
/Differences [32 /space 69 /E /F 72 /H /I 78 /N /O 82 /R 84 /T /U]
{code}
was (Author: tilman):
The reason that it works with the PDFBOX-2959-reduced.pdf is that this one has
the character codes in the PDF identical to the text. In the "new" file, the
codes are hex 1, 2, 3, 4 etc and that is what is extracted. Encoding:
3053 file:
{code}
/Differences [1 /space /A /n /a /d /u /l /t /s /h /o /g /e /m /r /c /y /p /i
/slash /f /colon /w /b /j /v /comma /k /period]
{code}
2959 file:
{code}
/Differences [32 /space 69 /E /F 72 /H /I 78 /N /O 82 /R 84 /T /U]
{code}
> Text extraction fails with type 3 fonts
> ---------------------------------------
>
> Key: PDFBOX-3053
> URL: https://issues.apache.org/jira/browse/PDFBOX-3053
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Tilman Hausherr
> Labels: type3
> Attachments: PDFBOX-2959-reduced.pdf,
> PDFBOX-3053-3YQ2UXRQBBLX5TLKSLFCUZLWXWSI2Z2U.pdf, PDFBOX-3053-reduced.pdf
>
>
> Text extraction fails with the attached file. It succeeds with Acrobat
> Reader, with PDF.js and with PDFBox 1.8.
> This is not a general type 3 problem. Text extraction works with
> PDFBOX-2959-reduced.pdf.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]