[ 
https://issues.apache.org/jira/browse/PDFBOX-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228889#comment-14228889
 ] 

Andreas Lehmkühler commented on PDFBOX-2377:
--------------------------------------------

I guess I've found a solution for the 1.8-branch. First of all I've reverted my 
former changes.

The problem is that the pdf doesn't provide any mapping (neither an encoding 
nor a toUnicode mapping) so that we have to decide where to get a suitable 
mapping ourselves. The former code uses the internal font encoding in such 
cases as fallback solution. But that doesn't work in every case (see 
PDFBOX-2247). It looks like it depends on the existence of a charset within the 
font descriptor. We can rely on the internal mapping if a charset is defined, 
otherwise we just map the byte(s) to a string.

IMHO we have to adapt that solution for the trunk as well.

> Apparent regression in character mapping in a few files from govdocs1
> ---------------------------------------------------------------------
>
>                 Key: PDFBOX-2377
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2377
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.7, 2.0.0
>            Reporter: Tim Allison
>            Assignee: Andreas Lehmkühler
>            Priority: Minor
>              Labels: regression
>             Fix For: 1.8.8, 2.0.0
>
>         Attachments: 290991-6.txt, 290991-7.txt, 290991-8.txt, 290991.pdf, 
> 312888.pdf, 357094-1.8.6.txt, 357094-1.8.8.txt, 357094.pdf, 764929.pdf, 
> PDFBOX2247-701542.pdf
>
>
> On a small number of test files in a 50k sample of pdfs from govdocs1, it 
> appears that some characters are no longer being extracted correctly in 1.8.7 
> when compared to 1.8.6.  I ran pdfbox's app.jar with ExtractText
> {noformat}
> 764929.pdf
> 1.8.6: Lang, Astrophysical Data: Planets and Stars
> 1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
> {noformat}
> and
> {noformat}
> 312888.pdf
> 1.8.6: Self-Assessment \u0026 Capability Description
> 1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to