[ https://issues.apache.org/jira/browse/PDFBOX-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18010489#comment-18010489 ]
Oliver Schmidtmer commented on PDFBOX-6046: ------------------------------------------- I would like to try changing the PDFTextStripper to first group the glyphs by words, and then sort by the bounds of the words, instead of sorting by glyph bounds. AFAIK commonly at least glyphs belonging to a word should be in correct order before sorting, or is there something I should be aware of? Doing sorting like this might also help for files like PDFBOX-5828. > PDFTextStripper: Sorting issue with overlaying text > --------------------------------------------------- > > Key: PDFBOX-6046 > URL: https://issues.apache.org/jira/browse/PDFBOX-6046 > Project: PDFBox > Issue Type: Bug > Reporter: Oliver Schmidtmer > Priority: Major > Attachments: 10600601393673.ANF - 20.03.2025, 08_57_48.pdf, > image-2025-07-28-20-24-32-787.png > > > We found an issue with the PDFTextStripper if text is "layered", with in this > case some spaces as placeholder. > The PDFs in question are templates for orders, which are filled with data in > a second step. > So if the text is ordered by concurrence in the PDF source, the first half > are the field labels, the second half then the field values. So we need > sorting by rendered position with PDFTextStripper#setSortByPosition(true) > Now as the first example of the file, what should be > "Auftraggeber: NAGEL-GROUP" > is extracted as > "Auftraggeber: N AGEL-GROUP" with a space. > !image-2025-07-28-20-24-32-787.png|width=440,height=62! > This is caused by spaces after "Auftraggeber: " as a placeholder in the > template, which overlap with the first glyph of the field value. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org