[
https://issues.apache.org/jira/browse/PDFBOX-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14233582#comment-14233582
]
Andreas Lehmkühler commented on PDFBOX-2532:
--------------------------------------------
{quote}
Any thoughts on how Acrobat is able to extract the text?
{quote}
I guess aacrobat does the same as the 1.8-branch, it simply ignores the
embedded encoding. The question is, how to detect such broken fonts? It looks
like the good ones have a CharSet entry in the font descriptor and the bad ones
not. Maybe because that information can't be build because of the broken
encoding. The symbolic flag is no help. The pdf font doesn't provide any
encoding which indicates that it uses some non-standard encoding.
> Text extraction fails due to the usage of the internal font mapping
> -------------------------------------------------------------------
>
> Key: PDFBOX-2532
> URL: https://issues.apache.org/jira/browse/PDFBOX-2532
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Andreas Lehmkühler
> Fix For: 2.0.0
>
> Attachments: PDFBOX2247-701542.pdf, PDFBOX2247-701542_cp_acrobat.txt,
> PDFBOX2247-701542_sa_acrobat.txt, PDFBOX2247-701542_sa_acrobat_osx.txt,
> PDFBOX2247-701542_sa_reader_osx.txt, PDFBOX2247-Debugger.png
>
>
> If a pdf doesn't provide any mapping (neither an encoding nor a toUnicode
> mapping) we have to decide where to get a suitable mapping ourselves. We
> can't use the internal font mapping of the type1C font as it doesn't work in
> every case, see PDFBOX-2377 which provides a solution for the 1.8-branch.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)