[jira] [Closed] (PDFBOX-5828) PDFTextStripper created garbled text

arjunce (Jira) Fri, 24 May 2024 03:02:12 -0700


     [ 
https://issues.apache.org/jira/browse/PDFBOX-5828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


arjunce closed PDFBOX-5828.
---------------------------
    Resolution: Invalid

> PDFTextStripper created garbled text
> ------------------------------------
>
>                 Key: PDFBOX-5828
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5828
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 3.0.1 PDFBox, 3.0.2 PDFBox
>            Reporter: arjunce
>            Priority: Trivial
>         Attachments: image-2024-05-24-11-59-31-786.png, 
> output_text_stripper.txt, screenshot-1.png, test.pdf
>
>
> Hello Folks, 
> I am using pdfbox to extract and manipulate text contents of the pdf and 
> using PDFTextStripper to extract text. I am also setting the below options:
> {code:java}
> PDDocument document = Loader.loadPDF(new 
> File("src/test/resources/pdf/test.pdf"));
> PDFTextStripper textStripper = new PDFTextStripper();
> textStripper.setSortByPosition(true);
> textStripper.setWordSeparator(" "); {code}
> The Textcomparator is not transitive as mentioned in a comment. The custom 
> merge sort implemented is messing up the text at the Individual character 
> level and I can't fix the text later. 
> I have attached the sample pdf and its text output below. The merge sort 
> doesn't consider the y coordinates and x coordinates when sorting the 
> letters. Adding that while sorting would fix this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Closed] (PDFBOX-5828) PDFTextStripper created garbled text

Reply via email to