[ 
https://issues.apache.org/jira/browse/PDFBOX-5406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed PDFBOX-5406.
-----------------------------------
    Resolution: Not A Bug

> Assumption of Identity Not Valid for Text Extraction
> ----------------------------------------------------
>
>                 Key: PDFBOX-5406
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5406
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.24
>            Reporter: Michael Tighe
>            Priority: Major
>
> PDF BOX issue 1090 (closed years ago) makes an assumption that can lead to 
> serious issues when the text extraction process returns garbage.
> Version: PDFBOX v2.0.24
> PDFBOX -> PDFont.java -> loadUnicodeCMap line 150
> The code distinctly KNOWS that there is no UNICODE map.
> It then makes a number of guesses - runs out of options, and explicitly makes 
> an assumption that silently creates bad output.{{{}{}}}
> {{    LOG.warn("Invalid ToUnicode CMap in font " + getName());}}
> {{    ...}}
> {{    LOG.warn("Using predefined identity CMap instead");}}
> Every document that I've seen that produces that WARNING has bad text 
> returned for the document when you use PDFBOX to do text extraction.
> My logic is that the CMap is being ignored by the producer of that PDF, and 
> assuming that it's possible to use the reverse causes silent failure on the 
> part of PDFBOX.  The software package calling PDFBOX gets no warning that 
> there is an issue.
> I propose that this code throw an exception rather than a warning.
> That way the extraction caller KNOWS that the text is wrong.
> I have examples identical to those shown in the original issue.
> Is there any more recent work on this issue?  E.g., parameters that could be 
> set to say "I want perfect extraction or no extraction"? 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to