arjunce created PDFBOX-5828:
-------------------------------

             Summary: PDFTextStripper created garbled text
                 Key: PDFBOX-5828
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5828
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 3.0.1 PDFBox
            Reporter: arjunce
         Attachments: output_text_stripper.txt, test.pdf

Hello Folks, 

I am using pdfbox to extract and manipulate text contents of the pdf and using 
PDFTextStripper to extract text. I am also setting the below options:
{code:java}
PDDocument document = Loader.loadPDF(new 
File("src/test/resources/pdf/test.pdf"));
PDFTextStripper textStripper = new PDFTextStripper();
textStripper.setSortByPosition(true);
textStripper.setWordSeparator(" "); {code}
When sorting the positions there is a "{*}Comparison method violates its 
general contract!{*}"

The exception is possibly due to the float values. The custom merge sort 
implemented is messing up the text at the Individual character level and I 
can't fix the text later. 

I have attached the sample pdf and its text output below. The merge sort 
doesn't consider the y coordinates and x coordinates when sorting the letters. 
Adding that while sorting would fix this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to