[ https://issues.apache.org/jira/browse/PDFBOX-5406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tilman Hausherr closed PDFBOX-5406. ----------------------------------- Resolution: Not A Bug > Assumption of Identity Not Valid for Text Extraction > ---------------------------------------------------- > > Key: PDFBOX-5406 > URL: https://issues.apache.org/jira/browse/PDFBOX-5406 > Project: PDFBox > Issue Type: Bug > Affects Versions: 2.0.24 > Reporter: Michael Tighe > Priority: Major > > PDF BOX issue 1090 (closed years ago) makes an assumption that can lead to > serious issues when the text extraction process returns garbage. > Version: PDFBOX v2.0.24 > PDFBOX -> PDFont.java -> loadUnicodeCMap line 150 > The code distinctly KNOWS that there is no UNICODE map. > It then makes a number of guesses - runs out of options, and explicitly makes > an assumption that silently creates bad output.{{{}{}}} > {{ LOG.warn("Invalid ToUnicode CMap in font " + getName());}} > {{ ...}} > {{ LOG.warn("Using predefined identity CMap instead");}} > Every document that I've seen that produces that WARNING has bad text > returned for the document when you use PDFBOX to do text extraction. > My logic is that the CMap is being ignored by the producer of that PDF, and > assuming that it's possible to use the reverse causes silent failure on the > part of PDFBOX. The software package calling PDFBOX gets no warning that > there is an issue. > I propose that this code throw an exception rather than a warning. > That way the extraction caller KNOWS that the text is wrong. > I have examples identical to those shown in the original issue. > Is there any more recent work on this issue? E.g., parameters that could be > set to say "I want perfect extraction or no extraction"? -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org