[
https://issues.apache.org/jira/browse/PDFBOX-4785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054824#comment-17054824
]
Tilman Hausherr commented on PDFBOX-4785:
-----------------------------------------
If we allow incorrect ranges we'll have other files that will get incorrect
text extractions or other problems. The version history of CMapParser.java
shows that we've been struggling with this since last May. I just tried
changing the code at the "PDFBOX-4661" comment to "int values = end - start"
and had 8 tests (!) that failed, despite disabling the specific test in fontbox.
To experience this yourself, try making a change in the parser until PDFBox
builds. When done that, please post the change here. I'll then test it on
additional files that are not in the source download due to licensing.
> No Unicode mapping with MS-Mincho
> ---------------------------------
>
> Key: PDFBOX-4785
> URL: https://issues.apache.org/jira/browse/PDFBOX-4785
> Project: PDFBox
> Issue Type: Bug
> Components: FontBox
> Affects Versions: 2.0.18, 2.0.19
> Reporter: Ryosuke Fujita
> Priority: Major
> Attachments: E02779_convocation_notice_p14.pdf
>
>
> ExtractText from attached pdf fails after v2.0.18 while v2.0.17 succeed.
> Error message is as follows, and can't extract character "最"(CID+7025).
> FEB 26, 2020 10:32:29 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+7025 (7025) in font NAEGKL+MS-Mincho
> This maybe related to PDFBOX-4661?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]