[ https://issues.apache.org/jira/browse/PDFBOX-5411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888352#comment-17888352 ]
Michael Klink commented on PDFBOX-5411: --------------------------------------- You guess right, with {{SortByPosition}} set to {{false}} text is extracted in the order it is drawn by the instructions in the content streams. Concerning your question, therefore - {quote}I wonder which corner cases were correctly detected before and would be no longer{quote} \- the cases that require sorting are those in which the text is _not_ drawn in reading order. Theoretically text in PDFs can be drawn in any order, so the need to sort can arise for arbitrary PDFs. In real PDFs text often is drawn in reading order as that's quite a natural thing to do. But there are exceptions. And as programs usually cannot determine which PDFs draw the text in reading order and which don't, many of them sort always, just in case. In particular if forms are prefilled (or filled and then flattened), you usually get content streams in which first all the labels and flavor texts are drawn and thereafter all the filled-in values. Sorting such PDFs allows for sensible text extraction. > PDFTextStripper could use text size in reconstruction > ----------------------------------------------------- > > Key: PDFBOX-5411 > URL: https://issues.apache.org/jira/browse/PDFBOX-5411 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction > Affects Versions: 2.0.25, 3.0.0 PDFBox > Reporter: Lapo Luchini > Priority: Minor > Attachments: image-2022-04-08-16-13-17-334.png, > image-2022-04-15-09-26-20-917.png, textDoubleText.pdf > > > When two texts are partially overlapping {{PDFTextStripper}} seems to return > a mix simply based on "leftmost x coordinate of the glyph", which makes > sense, but it could make use of glyph size to disambiguate "easy" cases like > this one: > !image-2022-04-08-16-13-17-334.png! > currently this is the first parameter of PDFTextStripper.writeString(String > string, List<TextPosition> textPositions): > {{"T0510E09620_S368b3aT92-29fa -4Leef-80I5e-N53c23efE7979f"}} > I would of course hope for two calls: > {{"TEST LINE"}} > {{"051009620_368b3a92-29fa-4eef-805e-53c23ef7979f"}} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org