[
https://issues.apache.org/jira/browse/PDFBOX-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012820#comment-17012820
]
Michael Klink commented on PDFBOX-4737:
---------------------------------------
A strict/lax mode could help prevent PDFBox from trying to extract text for
broken text extraction information, but broken text extraction information
usually is not what obfuscators create but instead what buggy PDF generators
create.
Obfuscators usually will generate PDFs without text extraction information
(like your examples) or with misleading information (like in [this stack
overflow q&a|https://stackoverflow.com/a/22688775/1729265]).
> Text extraction is gibberish
> ----------------------------
>
> Key: PDFBOX-4737
> URL: https://issues.apache.org/jira/browse/PDFBOX-4737
> Project: PDFBox
> Issue Type: Improvement
> Affects Versions: 2.0.18
> Reporter: Jorge Spinsanti
> Priority: Major
> Attachments: noUnicodeMapping.pdf, obfuscateTest_Duplicate_2_3.pdf
>
>
> As it was discussed on https://issues.apache.org/jira/browse/PDFBOX-4549
> there are many PDFs where the text extraction is gibberish.
> Perhaps you can add two modes (strict/lax) to text extraction to avoid
> gibberish if not useful. Add a file to analyze the problem.
> [^noUnicodeMapping.pdf]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]