[
https://issues.apache.org/jira/browse/PDFBOX-4612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16894384#comment-16894384
]
Tilman Hausherr commented on PDFBOX-4612:
-----------------------------------------
This is because of an incorrect glyph name (C24) in the font (page 7, font F5).
Adobe Reader is also unable to extract it properly, it also brings "ataxia, and
death by (SOH)4 months". (The attached file is our extraction) See also
[FAQ|[https://pdfbox.apache.org/2.0/faq.html#text-extraction].]
> The ExtractText command extracts wrong text
> -------------------------------------------
>
> Key: PDFBOX-4612
> URL: https://issues.apache.org/jira/browse/PDFBOX-4612
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.16
> Reporter: Yuri
> Priority: Major
> Attachments: bartel2018-p7.txt
>
>
> In this pdf [http://sci-hub.tw/10.1016/j.cell.2018.03.006] it extracts the
> text "ataxia, and death by ~4 months" as "ataxia, and death by ^A4 months".
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]