[
https://issues.apache.org/jira/browse/PDFBOX-5857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr updated PDFBOX-5857:
------------------------------------
Attachment: screenshot-1.png
> PDFTextStripper returns messed up data
> ---------------------------------------
>
> Key: PDFBOX-5857
> URL: https://issues.apache.org/jira/browse/PDFBOX-5857
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 3.0.2 PDFBox
> Reporter: arjunce
> Priority: Minor
> Attachments: extractedText.txt, jumbledtext.pdf, screenshot-1.png
>
>
> I have attached below the input pdf and its text output for you to take a
> look at. I am using PDFTextStripper along with these:
> {code:java}
> super();
> this.setSortByPosition(true);
> this.setWordSeparator("_word_"); {code}
> Since I am using sort by position the text is jumbled. Is there a way for me
> to detect this instead of outputting the jumbled text? Any help is
> appreciated, Thanks.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]