[
https://issues.apache.org/jira/browse/PDFBOX-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232503#comment-14232503
]
John Hewson commented on PDFBOX-2532:
-------------------------------------
{quote}
The text is already readable and gets scrambled by using the embedded encoding.
{quote}
I don't see any readable text, the first text drawing operation in the content
stream is:
{code}
[ (7) -19.7 (>) 25.3 (P) -19.7 (F) -19.7 (L) 14.3 (K) 5.3 (>) -19.7 (I) -19.7 (
9) -19.7 (>) 10.3 (N) -19.7 (H) -19.7 ( ) -25 (;) -19.7 (B) 25.3 (N) -19.7 (R)
-19.7 (F) 20.3 (@) -19.7 (B) ] TJ
{code}
Which corresponds to the string {{7>PFLK>I 9>NH ;BNRF@B}}. The font's Encoding
is given as MacRomanEncoding, which is the same as ASCII < 128. This also
matches the mapping given in the embedded Type1C font's Encoding.
> Text extraction fails due to the usage of the internal font mapping
> -------------------------------------------------------------------
>
> Key: PDFBOX-2532
> URL: https://issues.apache.org/jira/browse/PDFBOX-2532
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Andreas Lehmkühler
> Fix For: 2.0.0
>
> Attachments: PDFBOX2247-701542.pdf, PDFBOX2247-701542_cp_acrobat.txt,
> PDFBOX2247-701542_sa_acrobat.txt, PDFBOX2247-701542_sa_acrobat_osx.txt,
> PDFBOX2247-701542_sa_reader_osx.txt
>
>
> If a pdf doesn't provide any mapping (neither an encoding nor a toUnicode
> mapping) we have to decide where to get a suitable mapping ourselves. We
> can't use the internal font mapping of the type1C font as it doesn't work in
> every case, see PDFBOX-2377
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)