[ 
https://issues.apache.org/jira/browse/PDFBOX-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14233582#comment-14233582
 ] 

Andreas Lehmkühler commented on PDFBOX-2532:
--------------------------------------------

{quote}
Any thoughts on how Acrobat is able to extract the text?
{quote}
I guess aacrobat does the same as the 1.8-branch, it simply ignores the 
embedded encoding. The question is, how to detect such broken fonts? It looks 
like the good ones have a CharSet entry in the font descriptor and the bad ones 
not. Maybe because that information can't be build because of the broken 
encoding. The symbolic flag is no help. The pdf font doesn't provide any 
encoding which indicates that it uses some non-standard encoding.

> Text extraction fails due to the usage of the internal font mapping
> -------------------------------------------------------------------
>
>                 Key: PDFBOX-2532
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2532
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Andreas Lehmkühler
>             Fix For: 2.0.0
>
>         Attachments: PDFBOX2247-701542.pdf, PDFBOX2247-701542_cp_acrobat.txt, 
> PDFBOX2247-701542_sa_acrobat.txt, PDFBOX2247-701542_sa_acrobat_osx.txt, 
> PDFBOX2247-701542_sa_reader_osx.txt, PDFBOX2247-Debugger.png
>
>
> If a pdf doesn't provide any mapping (neither an encoding nor a toUnicode 
> mapping) we have to decide where to get a suitable mapping ourselves. We 
> can't use the internal font mapping of the type1C font as it doesn't work in 
> every case, see PDFBOX-2377 which provides a solution for the 1.8-branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to