[ 
https://issues.apache.org/jira/browse/PDFBOX-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232503#comment-14232503
 ] 

John Hewson commented on PDFBOX-2532:
-------------------------------------

{quote}
The text is already readable and gets scrambled by using the embedded encoding.
{quote}

I don't see any readable text, the first text drawing operation in the content 
stream is:

{code}
[ (7) -19.7 (>) 25.3 (P) -19.7 (F) -19.7 (L) 14.3 (K) 5.3 (>) -19.7 (I) -19.7 ( 
9) -19.7 (>) 10.3 (N) -19.7 (H) -19.7 ( ) -25 (;) -19.7 (B) 25.3 (N) -19.7 (R) 
-19.7 (F) 20.3 (@) -19.7 (B) ] TJ
{code}

Which corresponds to the string {{7>PFLK>I 9>NH ;BNRF@B}}. The font's Encoding 
is given as MacRomanEncoding, which is the same as ASCII < 128. This also 
matches the mapping given in the embedded Type1C font's Encoding.

> Text extraction fails due to the usage of the internal font mapping
> -------------------------------------------------------------------
>
>                 Key: PDFBOX-2532
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2532
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Andreas Lehmkühler
>             Fix For: 2.0.0
>
>         Attachments: PDFBOX2247-701542.pdf, PDFBOX2247-701542_cp_acrobat.txt, 
> PDFBOX2247-701542_sa_acrobat.txt, PDFBOX2247-701542_sa_acrobat_osx.txt, 
> PDFBOX2247-701542_sa_reader_osx.txt
>
>
> If a pdf doesn't provide any mapping (neither an encoding nor a toUnicode 
> mapping) we have to decide where to get a suitable mapping ourselves. We 
> can't use the internal font mapping of the type1C font as it doesn't work in 
> every case, see PDFBOX-2377



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to