[
https://issues.apache.org/jira/browse/PDFBOX-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146689#comment-14146689
]
John Hewson edited comment on PDFBOX-2377 at 9/24/14 6:33 PM:
--------------------------------------------------------------
Looking at the commits from PDFBOX-2247, the use of FirstChar is not correct.
FirstChar should not be involved in encoding, it should be used only for
retrieving glyph widths from the Widths array. The charOffset variable should
be removed, and the "code" variable in getCharacter() should be left un-tweaked.
It looks like PDType1CFont#getFontWidth() isn't using the Widths array at all,
which might be the cause of the original problem? Or there may be deeper
encoding issues with 1.8. None of this applies to the trunk anymore.
was (Author: jahewson):
Looking at the commits from PDFBOX-2247, the use of FirstChar is not correct.
FirstChar should not be involved in encoding, it is used only for retrieving
glyph widths from the Widths array. The charOffset variable should be removed,
and the "code" variable in getCharacter() should be left un-tweaked.
It looks like PDType1CFont#getFontWidth() isn't using the Widths array at all,
which might be the cause of the original problem? Or there may be deeper
encoding issues with 1.8. None of this applies to the trunk anymore.
> Apparent regression in character mapping in a few files from govdocs1
> ---------------------------------------------------------------------
>
> Key: PDFBOX-2377
> URL: https://issues.apache.org/jira/browse/PDFBOX-2377
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.7
> Reporter: Tim Allison
> Priority: Minor
> Labels: regression
> Attachments: 312888.pdf, 764929.pdf
>
>
> On a small number of test files in a 50k sample of pdfs from govdocs1, it
> appears that some characters are no longer being extracted correctly in 1.8.7
> when compared to 1.8.6. I ran pdfbox's app.jar with ExtractText
> {noformat}
> 764949.pdf
> 1.8.6: Lang, Astrophysical Data: Planets and Stars
> 1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
> {noformat}
> and
> {noformat}
> 312888.pdf
> 1.8.6: Self-Assessment \u0026 Capability Description
> 1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)