[
https://issues.apache.org/jira/browse/PDFBOX-5406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515961#comment-17515961
]
Tilman Hausherr commented on PDFBOX-5406:
-----------------------------------------
Yes sometimes we get trash. But there are also cases where Adobe Reader brings
trash. Some files have a /ToUnicode map and still return trash.
We don't have a "strict" setting because there's no simple solution. Use a word
dictionary to detect whether the output is trash, and then run OCR.
> Assumption of Identity Not Valid for Text Extraction
> ----------------------------------------------------
>
> Key: PDFBOX-5406
> URL: https://issues.apache.org/jira/browse/PDFBOX-5406
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 2.0.24
> Reporter: Michael Tighe
> Priority: Major
>
> PDF BOX issue 1090 (closed years ago) makes an assumption that can lead to
> serious issues when the text extraction process returns garbage.
> Version: PDFBOX v2.0.24
> PDFBOX -> PDFont.java -> loadUnicodeCMap line 150
> The code distinctly KNOWS that there is no UNICODE map.
> It then makes a number of guesses - runs out of options, and explicitly makes
> an assumption that silently creates bad output.{{{}{}}}
> {{ LOG.warn("Invalid ToUnicode CMap in font " + getName());}}
> {{ ...}}
> {{ LOG.warn("Using predefined identity CMap instead");}}
> Every document that I've seen that produces that WARNING has bad text
> returned for the document when you use PDFBOX to do text extraction.
> My logic is that the CMap is being ignored by the producer of that PDF, and
> assuming that it's possible to use the reverse causes silent failure on the
> part of PDFBOX. The software package calling PDFBOX gets no warning that
> there is an issue.
> I propose that this code throw an exception rather than a warning.
> That way the extraction caller KNOWS that the text is wrong.
> I have examples identical to those shown in the original issue.
> Is there any more recent work on this issue? E.g., parameters that could be
> set to say "I want perfect extraction or no extraction"?
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]