[ https://issues.apache.org/jira/browse/PDFBOX-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tilman Hausherr updated PDFBOX-4737: ------------------------------------ Description: As it was discussed on PDFBOX-4549 there are many PDFs where the text extraction is gibberish. Perhaps you can add two modes (strict/lax) to text extraction to avoid gibberish if not useful. Add a file to analyze the problem. [^noUnicodeMapping.pdf] was: As it was discussed on https://issues.apache.org/jira/browse/PDFBOX-4549 there are many PDFs where the text extraction is gibberish. Perhaps you can add two modes (strict/lax) to text extraction to avoid gibberish if not useful. Add a file to analyze the problem. [^noUnicodeMapping.pdf] > Text extraction is gibberish > ---------------------------- > > Key: PDFBOX-4737 > URL: https://issues.apache.org/jira/browse/PDFBOX-4737 > Project: PDFBox > Issue Type: Improvement > Affects Versions: 2.0.18 > Reporter: Jorge Spinsanti > Priority: Major > Attachments: noUnicodeMapping.pdf, obfuscateTest_Duplicate_2_3.pdf > > > As it was discussed on PDFBOX-4549 there are many PDFs where the text > extraction is gibberish. > Perhaps you can add two modes (strict/lax) to text extraction to avoid > gibberish if not useful. Add a file to analyze the problem. > [^noUnicodeMapping.pdf] -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org