[jira] [Comment Edited] (PDFBOX-2377) Apparent regression in character mapping in a few files from govdocs1

John Hewson (JIRA) Wed, 24 Sep 2014 11:39:54 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146689#comment-14146689
 ]


John Hewson edited comment on PDFBOX-2377 at 9/24/14 6:38 PM:
--------------------------------------------------------------

Looking at the commits from PDFBOX-2247, the use of FirstChar is not correct. 
FirstChar should not be involved in encoding, it should be used only for 
retrieving glyph widths from the Widths array. The"charOffset" variable isn't 
needed, and the "code" variable in getCharacter() should be left un-tweaked.

It looks like PDType1CFont#getFontWidth() isn't using the Widths array at all, 
which might be the cause of some problems? Or there may be deeper encoding 
issues with 1.8, the original issue was PDFBOX-2058. None of this applies to 
the trunk anymore.


was (Author: jahewson):
Looking at the commits from PDFBOX-2247, the use of FirstChar is not correct. 
FirstChar should not be involved in encoding, it should be used only for 
retrieving glyph widths from the Widths array. The charOffset variable isn't 
needed, and the "code" variable in getCharacter() should be left un-tweaked.

It looks like PDType1CFont#getFontWidth() isn't using the Widths array at all, 
which might be the cause of some problems? Or there may be deeper encoding 
issues with 1.8, the original issue was PDFBOX-2058. None of this applies to 
the trunk anymore.

> Apparent regression in character mapping in a few files from govdocs1
> ---------------------------------------------------------------------
>
>                 Key: PDFBOX-2377
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2377
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.7
>            Reporter: Tim Allison
>            Priority: Minor
>              Labels: regression
>         Attachments: 312888.pdf, 764929.pdf
>
>
> On a small number of test files in a 50k sample of pdfs from govdocs1, it 
> appears that some characters are no longer being extracted correctly in 1.8.7 
> when compared to 1.8.6.  I ran pdfbox's app.jar with ExtractText
> {noformat}
> 764949.pdf
> 1.8.6: Lang, Astrophysical Data: Planets and Stars
> 1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
> {noformat}
> and
> {noformat}
> 312888.pdf
> 1.8.6: Self-Assessment \u0026 Capability Description
> 1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (PDFBOX-2377) Apparent regression in character mapping in a few files from govdocs1

Reply via email to