[jira] [Commented] (PDFBOX-2532) Text extraction fails due to the usage of the internal font mapping

JIRA Mon, 08 Dec 2014 13:36:13 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14238494#comment-14238494
 ]


Andreas Lehmkühler commented on PDFBOX-2532:
--------------------------------------------

{quote}
It's very common to need to extract the Encoding from Type1C fonts, so Acrobat 
must be doing something other than just ignoring the encoding. Either it's a 
bug in Acrobat (which happens to produce good behaviour for this file) or they 
have some sort of heuristic.
{quote}
It has to be a new bug as It worked with older acrobat versions.

{quote}
The CharSet entry can't be the deciding factor, because it is optional, and its 
entries are unordered, so it provides no help in identifying a "jumbled" 
encoding. Two different encodings which contain the same characters will have 
the same CharSet, even if their order is different.
{quote}
I know the specs. Anyway, in all cases I know it was a good indicator for 
broken fonts.


> Text extraction fails due to the usage of the internal font mapping
> -------------------------------------------------------------------
>
>                 Key: PDFBOX-2532
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2532
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Andreas Lehmkühler
>             Fix For: 2.0.0
>
>         Attachments: PDFBOX2247-701542.pdf, PDFBOX2247-701542_cp_acrobat.txt, 
> PDFBOX2247-701542_sa_acrobat.txt, PDFBOX2247-701542_sa_acrobat_osx.txt, 
> PDFBOX2247-701542_sa_reader_osx.txt, PDFBOX2247-Debugger.png
>
>
> If a pdf doesn't provide any mapping (neither an encoding nor a toUnicode 
> mapping) we have to decide where to get a suitable mapping ourselves. We 
> can't use the internal font mapping of the type1C font as it doesn't work in 
> every case, see PDFBOX-2377 which provides a solution for the 1.8-branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PDFBOX-2532) Text extraction fails due to the usage of the internal font mapping

Reply via email to