[ https://issues.apache.org/jira/browse/PDFBOX-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18010686#comment-18010686 ]
Tilman Hausherr commented on PDFBOX-6046: ----------------------------------------- I've added the option to PDFDebugger in PDFBOX-6047 so you can play with it to see the difference. > PDFTextStripper: Sorting issue with overlaying text > --------------------------------------------------- > > Key: PDFBOX-6046 > URL: https://issues.apache.org/jira/browse/PDFBOX-6046 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Reporter: Oliver Schmidtmer > Priority: Major > Attachments: 10600601393673.ANF - 20.03.2025, 08_57_48.pdf, > PDFBOX-6046-reduced.pdf, image-2025-07-28-20-24-32-787.png > > > We found an issue with the PDFTextStripper if text is "layered", with in this > case some spaces as placeholder. > The PDFs in question are templates for orders, which are filled with data in > a second step. > So if the text is ordered by concurrence in the PDF source, the first half > are the field labels, the second half then the field values. So we need > sorting by rendered position with PDFTextStripper#setSortByPosition(true) > Now as the first example of the file, what should be > "Auftraggeber: NAGEL-GROUP" > is extracted as > "Auftraggeber: N AGEL-GROUP" with a space. > !image-2025-07-28-20-24-32-787.png|width=440,height=62! > This is caused by spaces after "Auftraggeber: " as a placeholder in the > template, which overlap with the first glyph of the field value. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org