[ 
https://issues.apache.org/jira/browse/PDFBOX-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18010686#comment-18010686
 ] 

Tilman Hausherr commented on PDFBOX-6046:
-----------------------------------------

I've added the option to PDFDebugger in PDFBOX-6047 so you can play with it to 
see the difference.

> PDFTextStripper: Sorting issue with overlaying text
> ---------------------------------------------------
>
>                 Key: PDFBOX-6046
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6046
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Oliver Schmidtmer
>            Priority: Major
>         Attachments: 10600601393673.ANF - 20.03.2025, 08_57_48.pdf, 
> PDFBOX-6046-reduced.pdf, image-2025-07-28-20-24-32-787.png
>
>
> We found an issue with the PDFTextStripper if text is "layered", with in this 
> case some spaces as placeholder.
> The PDFs in question are templates for orders, which are filled with data in 
> a second step.
> So if the text is ordered by concurrence in the PDF source, the first half 
> are the field labels, the second half then the field values. So we need 
> sorting by rendered position with PDFTextStripper#setSortByPosition(true)
> Now as the first example of the file, what should be
> "Auftraggeber: NAGEL-GROUP"
> is extracted as
> "Auftraggeber: N AGEL-GROUP" with a space.
> !image-2025-07-28-20-24-32-787.png|width=440,height=62!
> This is caused by spaces after "Auftraggeber:  " as a placeholder in the 
> template, which overlap with the first glyph of the field value.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to