[
https://issues.apache.org/jira/browse/PDFBOX-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14150565#comment-14150565
]
Andreas Lehmkühler commented on PDFBOX-2377:
--------------------------------------------
{quote}
I don't get why characters < 32 would be handled differently?
{quote}
All codes >= 32 and <= 127 can be mapped using ASCII, Everything < 32 can't be
mapped that easily, e.g. by using ISO-8859-1 or UTF-8, and we have to use the
internal font mapping.
But this doesn't work in any case.
- 357094.pdf some of the "." are reresented as ":" within the pdf, so that
those will be mapped wrong using ASCII. The font mapping would provide a
correct value. Adobe provides the same wrong result
- 312888.pdf works fine with the current implementation. It works too if we use
the internal font mapping for all characters
- 701542.pdf from PDFBOX-2247 works fine with the current implementation, but
provides rubbish if we use the internal font mapping for all characters
It looks as if we have to implement a compromise, as it seems that neither of
the possible implementations will provide a 100% solution
> Apparent regression in character mapping in a few files from govdocs1
> ---------------------------------------------------------------------
>
> Key: PDFBOX-2377
> URL: https://issues.apache.org/jira/browse/PDFBOX-2377
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.7
> Reporter: Tim Allison
> Assignee: Andreas Lehmkühler
> Priority: Minor
> Labels: regression
> Attachments: 290991-6.txt, 290991-7.txt, 290991-8.txt, 290991.pdf,
> 312888.pdf, 357094-1.8.6.txt, 357094-1.8.8.txt, 357094.pdf, 764929.pdf,
> PDFBOX2247-701542.pdf
>
>
> On a small number of test files in a 50k sample of pdfs from govdocs1, it
> appears that some characters are no longer being extracted correctly in 1.8.7
> when compared to 1.8.6. I ran pdfbox's app.jar with ExtractText
> {noformat}
> 764949.pdf
> 1.8.6: Lang, Astrophysical Data: Planets and Stars
> 1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
> {noformat}
> and
> {noformat}
> 312888.pdf
> 1.8.6: Self-Assessment \u0026 Capability Description
> 1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)