[jira] [Commented] (PDFBOX-4737) Text extraction is gibberish

Michael Klink (Jira) Wed, 15 Jan 2020 04:13:13 -0800


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015900#comment-17015900
 ]


Michael Klink commented on PDFBOX-4737:
---------------------------------------

[~tilman]
 There is one argument in favor of introducing a "strict" switch: As long as 
PDFBox does attempt to extract text for content for which strictly speaking the 
required information are missing, one cannot turn to the PDF producer and 
complain about gibberish from text extraction because the producer can refute 
and say that the gibberish is created by PDFBox out of thin air. But if one can 
counter-check by extracting with the strict mode and still gets gibberish, one 
can tell the producer no, the gibberish is the text the PDF explicitly offers.

Depending on one's legal relation to the PDF producer this may be necessary to 
make a claim for repaired documents legitimate.

On the other hand of course a proper implementation of a strict mode will 
require quite a lot of work and a half-hearted implementation is worthless.

> Text extraction is gibberish
> ----------------------------
>
>                 Key: PDFBOX-4737
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4737
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 2.0.18
>            Reporter: Jorge Spinsanti
>            Priority: Major
>         Attachments: noUnicodeMapping.pdf, obfuscateTest_Duplicate_2_3.pdf
>
>
> As it was discussed on https://issues.apache.org/jira/browse/PDFBOX-4549 
> there are many PDFs where the text extraction is gibberish.
> Perhaps you can add two modes (strict/lax) to text extraction to avoid 
> gibberish if not useful. Add a file to analyze the problem.
> [^noUnicodeMapping.pdf]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-4737) Text extraction is gibberish

Reply via email to