[jira] [Commented] (PDFBOX-5406) Assumption of Identity Not Valid for Text Extraction

Michael Tighe (Jira) Fri, 01 Apr 2022 08:19:05 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515962#comment-17515962
 ]


Michael Tighe commented on PDFBOX-5406:
---------------------------------------

Thanks for your reply.  Your insight is valuable:  "Some files have a 
/ToUnicode map and still return trash".

I had already thought about the "use a word dictionary" solution and will use 
your input to push it through my product pipeline.

Thanks!

-- Michael


On 4/1/22, 11:14 AM, "Tilman Hausherr (Jira)" <[email protected]> wrote:


        [ 
https://issues.apache.org/jira/browse/PDFBOX-5406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515961#comment-17515961
  ] 

    Tilman Hausherr commented on PDFBOX-5406:
    -----------------------------------------

    Yes sometimes we get trash. But there are also cases where Adobe Reader 
brings trash. Some files have a /ToUnicode map and still return trash.

    We don't have a "strict" setting because there's no simple solution. Use a 
word dictionary to detect whether the output is trash, and then run OCR.

    > Assumption of Identity Not Valid for Text Extraction
    > ----------------------------------------------------
    >
    >                 Key: PDFBOX-5406
    >                 URL: https://issues.apache.org/jira/browse/PDFBOX-5406 
    >             Project: PDFBox
    >          Issue Type: Bug
    >    Affects Versions: 2.0.24
    >            Reporter: Michael Tighe
    >            Priority: Major
    >
    > PDF BOX issue 1090 (closed years ago) makes an assumption that can lead 
to serious issues when the text extraction process returns garbage.
    > Version: PDFBOX v2.0.24
    > PDFBOX -> PDFont.java -> loadUnicodeCMap line 150
    > The code distinctly KNOWS that there is no UNICODE map.
    > It then makes a number of guesses - runs out of options, and explicitly 
makes an assumption that silently creates bad output.{{{}{}}}
    > {{    LOG.warn("Invalid ToUnicode CMap in font " + getName());}}
    > {{    ...}}
    > {{    LOG.warn("Using predefined identity CMap instead");}}
    > Every document that I've seen that produces that WARNING has bad text 
returned for the document when you use PDFBOX to do text extraction.
    > My logic is that the CMap is being ignored by the producer of that PDF, 
and assuming that it's possible to use the reverse causes silent failure on the 
part of PDFBOX.  The software package calling PDFBOX gets no warning that there 
is an issue.
    > I propose that this code throw an exception rather than a warning.
    > That way the extraction caller KNOWS that the text is wrong.
    > I have examples identical to those shown in the original issue.
    > Is there any more recent work on this issue?  E.g., parameters that could 
be set to say "I want perfect extraction or no extraction"? 



    --
    This message was sent by Atlassian Jira
    (v8.20.1#820001)


> Assumption of Identity Not Valid for Text Extraction
> ----------------------------------------------------
>
>                 Key: PDFBOX-5406
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5406
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.24
>            Reporter: Michael Tighe
>            Priority: Major
>
> PDF BOX issue 1090 (closed years ago) makes an assumption that can lead to 
> serious issues when the text extraction process returns garbage.
> Version: PDFBOX v2.0.24
> PDFBOX -> PDFont.java -> loadUnicodeCMap line 150
> The code distinctly KNOWS that there is no UNICODE map.
> It then makes a number of guesses - runs out of options, and explicitly makes 
> an assumption that silently creates bad output.{{{}{}}}
> {{    LOG.warn("Invalid ToUnicode CMap in font " + getName());}}
> {{    ...}}
> {{    LOG.warn("Using predefined identity CMap instead");}}
> Every document that I've seen that produces that WARNING has bad text 
> returned for the document when you use PDFBOX to do text extraction.
> My logic is that the CMap is being ignored by the producer of that PDF, and 
> assuming that it's possible to use the reverse causes silent failure on the 
> part of PDFBOX.  The software package calling PDFBOX gets no warning that 
> there is an issue.
> I propose that this code throw an exception rather than a warning.
> That way the extraction caller KNOWS that the text is wrong.
> I have examples identical to those shown in the original issue.
> Is there any more recent work on this issue?  E.g., parameters that could be 
> set to say "I want perfect extraction or no extraction"? 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-5406) Assumption of Identity Not Valid for Text Extraction

Reply via email to