[
https://issues.apache.org/jira/browse/PDFBOX-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015900#comment-17015900
]
Michael Klink commented on PDFBOX-4737:
---------------------------------------
[~tilman]
There is one argument in favor of introducing a "strict" switch: As long as
PDFBox does attempt to extract text for content for which strictly speaking the
required information are missing, one cannot turn to the PDF producer and
complain about gibberish from text extraction because the producer can refute
and say that the gibberish is created by PDFBox out of thin air. But if one can
counter-check by extracting with the strict mode and still gets gibberish, one
can tell the producer no, the gibberish is the text the PDF explicitly offers.
Depending on one's legal relation to the PDF producer this may be necessary to
make a claim for repaired documents legitimate.
On the other hand of course a proper implementation of a strict mode will
require quite a lot of work and a half-hearted implementation is worthless.
> Text extraction is gibberish
> ----------------------------
>
> Key: PDFBOX-4737
> URL: https://issues.apache.org/jira/browse/PDFBOX-4737
> Project: PDFBox
> Issue Type: Improvement
> Affects Versions: 2.0.18
> Reporter: Jorge Spinsanti
> Priority: Major
> Attachments: noUnicodeMapping.pdf, obfuscateTest_Duplicate_2_3.pdf
>
>
> As it was discussed on https://issues.apache.org/jira/browse/PDFBOX-4549
> there are many PDFs where the text extraction is gibberish.
> Perhaps you can add two modes (strict/lax) to text extraction to avoid
> gibberish if not useful. Add a file to analyze the problem.
> [^noUnicodeMapping.pdf]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]