arjunce created PDFBOX-5828:
-------------------------------
Summary: PDFTextStripper created garbled text
Key: PDFBOX-5828
URL: https://issues.apache.org/jira/browse/PDFBOX-5828
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 3.0.1 PDFBox
Reporter: arjunce
Attachments: output_text_stripper.txt, test.pdf
Hello Folks,
I am using pdfbox to extract and manipulate text contents of the pdf and using
PDFTextStripper to extract text. I am also setting the below options:
{code:java}
PDDocument document = Loader.loadPDF(new
File("src/test/resources/pdf/test.pdf"));
PDFTextStripper textStripper = new PDFTextStripper();
textStripper.setSortByPosition(true);
textStripper.setWordSeparator(" "); {code}
When sorting the positions there is a "{*}Comparison method violates its
general contract!{*}"
The exception is possibly due to the float values. The custom merge sort
implemented is messing up the text at the Individual character level and I
can't fix the text later.
I have attached the sample pdf and its text output below. The merge sort
doesn't consider the y coordinates and x coordinates when sorting the letters.
Adding that while sorting would fix this issue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]