Michael Tighe created PDFBOX-5406:
-------------------------------------

             Summary: Assumption of Identity Not Valid for Text Extraction
                 Key: PDFBOX-5406
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5406
             Project: PDFBox
          Issue Type: Bug
    Affects Versions: 2.0.24
            Reporter: Michael Tighe


PDF BOX issue 1090 (closed years ago) makes an assumption that can lead to 
serious issues when the text extraction process returns garbage.

Version: PDFBOX v2.0.24

PDFBOX -> PDFont.java -> loadUnicodeCMap line 150

The code distinctly KNOWS that there is no UNICODE map.

It then makes a number of guesses - runs out of options, and explicitly makes 
an assumption that silently creates bad output.{{{}{}}}

{{    LOG.warn("Invalid ToUnicode CMap in font " + getName());}}

{{    ...}}

{{    LOG.warn("Using predefined identity CMap instead");}}

Every document that I've seen that produces that WARNING has bad text returned 
for the document when you use PDFBOX to do text extraction.

My logic is that the CMap is being ignored by the producer of that PDF, and 
assuming that it's possible to use the reverse causes silent failure on the 
part of PDFBOX.  The software package calling PDFBOX gets no warning that there 
is an issue.

I propose that this code throw an exception rather than a warning.

That way the extraction caller KNOWS that the text is wrong.

I have examples identical to those shown in the original issue.

Is there any more recent work on this issue?  E.g., parameters that could be 
set to say "I want perfect extraction or no extraction"? 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to