[
https://issues.apache.org/jira/browse/PDFBOX-5828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
arjunce closed PDFBOX-5828.
---------------------------
Resolution: Invalid
> PDFTextStripper created garbled text
> ------------------------------------
>
> Key: PDFBOX-5828
> URL: https://issues.apache.org/jira/browse/PDFBOX-5828
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 3.0.1 PDFBox, 3.0.2 PDFBox
> Reporter: arjunce
> Priority: Trivial
> Attachments: image-2024-05-24-11-59-31-786.png,
> output_text_stripper.txt, screenshot-1.png, test.pdf
>
>
> Hello Folks,
> I am using pdfbox to extract and manipulate text contents of the pdf and
> using PDFTextStripper to extract text. I am also setting the below options:
> {code:java}
> PDDocument document = Loader.loadPDF(new
> File("src/test/resources/pdf/test.pdf"));
> PDFTextStripper textStripper = new PDFTextStripper();
> textStripper.setSortByPosition(true);
> textStripper.setWordSeparator(" "); {code}
> The Textcomparator is not transitive as mentioned in a comment. The custom
> merge sort implemented is messing up the text at the Individual character
> level and I can't fix the text later.
> I have attached the sample pdf and its text output below. The merge sort
> doesn't consider the y coordinates and x coordinates when sorting the
> letters. Adding that while sorting would fix this issue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]