[ 
https://issues.apache.org/jira/browse/PDFBOX-5411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888352#comment-17888352
 ] 

Michael Klink commented on PDFBOX-5411:
---------------------------------------

You guess right, with {{SortByPosition}} set to {{false}} text is extracted in 
the order it is drawn by the instructions in the content streams. Concerning 
your question, therefore -
{quote}I wonder which corner cases were correctly detected before and would be 
no longer{quote}
\- the cases that require sorting are those in which the text is _not_ drawn in 
reading order. Theoretically text in PDFs can be drawn in any order, so the 
need to sort can arise for arbitrary PDFs. In real PDFs text often is drawn in 
reading order as that's quite a natural thing to do. But there are exceptions. 
And as programs usually cannot determine which PDFs draw the text in reading 
order and which don't, many of them sort always, just in case.

In particular if forms are prefilled (or filled and then flattened), you 
usually get content streams in which first all the labels and flavor texts are 
drawn and thereafter all the filled-in values. Sorting such PDFs allows for 
sensible text extraction.

> PDFTextStripper could use text size in reconstruction
> -----------------------------------------------------
>
>                 Key: PDFBOX-5411
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5411
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 2.0.25, 3.0.0 PDFBox
>            Reporter: Lapo Luchini
>            Priority: Minor
>         Attachments: image-2022-04-08-16-13-17-334.png, 
> image-2022-04-15-09-26-20-917.png, textDoubleText.pdf
>
>
> When two texts are partially overlapping {{PDFTextStripper}} seems to return 
> a mix simply based on "leftmost x coordinate of the glyph", which makes 
> sense, but it could make use of glyph size to disambiguate "easy" cases like 
> this one:
> !image-2022-04-08-16-13-17-334.png!
> currently this is the first parameter of PDFTextStripper.writeString(String 
> string, List<TextPosition> textPositions):
> {{"T0510E09620_S368b3aT92-29fa -4Leef-80I5e-N53c23efE7979f"}}
> I would of course hope for two calls:
> {{"TEST LINE"}}
> {{"051009620_368b3a92-29fa-4eef-805e-53c23ef7979f"}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to