[jira] [Comment Edited] (PDFBOX-5828) PDFTextStripper created garbled text

Tilman Hausherr (Jira) Fri, 24 May 2024 03:01:06 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849237#comment-17849237
 ]


Tilman Hausherr edited comment on PDFBOX-5828 at 5/24/24 9:59 AM:
------------------------------------------------------------------

The sort mode usually works, the problem with this file is that the font has 
huge Ascent and CapHeight values, 
 !image-2024-05-24-11-59-31-786.png! 
so PDFBox thinks that these glyphs are all on the same line. A look at 
{{TextPositionComparator}} shows that y is considered.
 !screenshot-1.png! 

You added the "easyfix" label. Feel free to attach a patch and I'll test it.

Your file also has another problem, one of the font directories points to an 
incorrect object. This is likely because of a nasty PDFBox bug that was fixed 
in 3.0.2 or will be fixed in 3.0.3. I assume you ran a split on the original 
file.


was (Author: tilman):
The sort mode usually works, the problem with this file is that the font has 
huge Ascent and CapHeight values, so PDFBox thinks that these glyphs are all on 
the same line. A look at {{TextPositionComparator}} shows that y is considered.
 !screenshot-1.png! 

You added the "easyfix" label. Feel free to attach a patch and I'll test it.

Your file also has another problem, one of the font directories points to an 
incorrect object. This is likely because of a nasty PDFBox bug that was fixed 
in 3.0.2 or will be fixed in 3.0.3. I assume you ran a split on the original 
file.

> PDFTextStripper created garbled text
> ------------------------------------
>
>                 Key: PDFBOX-5828
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5828
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 3.0.1 PDFBox, 3.0.2 PDFBox
>            Reporter: arjunce
>            Priority: Trivial
>         Attachments: image-2024-05-24-11-59-31-786.png, 
> output_text_stripper.txt, screenshot-1.png, test.pdf
>
>
> Hello Folks, 
> I am using pdfbox to extract and manipulate text contents of the pdf and 
> using PDFTextStripper to extract text. I am also setting the below options:
> {code:java}
> PDDocument document = Loader.loadPDF(new 
> File("src/test/resources/pdf/test.pdf"));
> PDFTextStripper textStripper = new PDFTextStripper();
> textStripper.setSortByPosition(true);
> textStripper.setWordSeparator(" "); {code}
> The Textcomparator is not transitive as mentioned in a comment. The custom 
> merge sort implemented is messing up the text at the Individual character 
> level and I can't fix the text later. 
> I have attached the sample pdf and its text output below. The merge sort 
> doesn't consider the y coordinates and x coordinates when sorting the 
> letters. Adding that while sorting would fix this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5828) PDFTextStripper created garbled text

Reply via email to