[ 
https://issues.apache.org/jira/browse/PDFBOX-3053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14972705#comment-14972705
 ] 

Tilman Hausherr edited comment on PDFBOX-3053 at 10/24/15 4:57 PM:
-------------------------------------------------------------------

The reason that it works with the PDFBOX-2959-reduced.pdf is that this one has 
the character codes in the PDF identical to the text. In the "new" file, the 
codes are hex 1, 2, 3, 4 etc and that is what is extracted. Encoding:

3053 file (is ignored):
{code}
/Differences [1 /space /A /n /a /d /u /l /t /s /h /o /g /e /m /r /c /y /p /i 
/slash /f /colon /w /b /j /v /comma /k /period]
{code}

2959 file (is 1:1):
{code}
/Differences [32 /space 69 /E /F 72 /H /I 78 /N /O 82 /R 84 /T /U]
{code}



was (Author: tilman):
The reason that it works with the PDFBOX-2959-reduced.pdf is that this one has 
the character codes in the PDF identical to the text. In the "new" file, the 
codes are hex 1, 2, 3, 4 etc and that is what is extracted. Encoding:

3053 file:
{code}
/Differences [1 /space /A /n /a /d /u /l /t /s /h /o /g /e /m /r /c /y /p /i 
/slash /f /colon /w /b /j /v /comma /k /period]
{code}

2959 file:
{code}
/Differences [32 /space 69 /E /F 72 /H /I 78 /N /O 82 /R 84 /T /U]
{code}


> Text extraction fails with type 3 fonts
> ---------------------------------------
>
>                 Key: PDFBOX-3053
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3053
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Tilman Hausherr
>              Labels: type3
>         Attachments: PDFBOX-2959-reduced.pdf, 
> PDFBOX-3053-3YQ2UXRQBBLX5TLKSLFCUZLWXWSI2Z2U.pdf, PDFBOX-3053-reduced.pdf
>
>
> Text extraction fails with the attached file. It succeeds with Acrobat 
> Reader, with PDF.js and with PDFBox 1.8.
> This is not a general type 3 problem. Text extraction works with 
> PDFBOX-2959-reduced.pdf.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to