arjunce created PDFBOX-5857:
-------------------------------
Summary: PDFTextStripper returns messed up data
Key: PDFBOX-5857
URL: https://issues.apache.org/jira/browse/PDFBOX-5857
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 3.0.2 PDFBox
Reporter: arjunce
Attachments: extractedText.txt, jumbledtext.pdf
I have attached below the input pdf and its text output for you to take a look
at. I am using PDFTextStripper along with these:
{code:java}
super();
this.setSortByPosition(true);
this.setWordSeparator("_word_"); {code}
Since I am using sort by position the text is jumbled. Is there a way for me to
detect this instead of outputting the jumbled text? Any help is appreciated,
Thanks.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]