[ 
https://issues.apache.org/jira/browse/PDFBOX-5411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522705#comment-17522705
 ] 

Lapo Luchini commented on PDFBOX-5411:
--------------------------------------

Yes, that makes sense. It is not an "easy" case, just a "IMHO somewhat 
solvable" one.

Example given: I would prefer to have text separation for the first example and 
"nothing" (but single letters) for a case like this:

!image-2022-04-15-09-26-20-917.png!

what I mean is: it's not a change that would only bring good results, it would 
potentially break existing good ones, but maybe the improvement in the firsts 
is better than the worsening of the latter.

And maybe a "text size difference of max ±20%" would be a good heuristic to 
have both (I wonder).

> PDFTextStripper could use text size in reconstruction
> -----------------------------------------------------
>
>                 Key: PDFBOX-5411
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5411
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 2.0.25, 3.0.0 PDFBox
>            Reporter: Lapo Luchini
>            Priority: Minor
>         Attachments: image-2022-04-08-16-13-17-334.png, 
> image-2022-04-15-09-26-20-917.png, textDoubleText.pdf
>
>
> When two texts are partially overlapping {{PDFTextStripper}} seems to return 
> a mix simply based on "leftmost x coordinate of the glyph", which makes 
> sense, but it could make use of glyph size to disambiguate "easy" cases like 
> this one:
> !image-2022-04-08-16-13-17-334.png!
> currently this is the first parameter of PDFTextStripper.writeString(String 
> string, List<TextPosition> textPositions):
> {{"T0510E09620_S368b3aT92-29fa -4Leef-80I5e-N53c23efE7979f"}}
> I would of course hope for two calls:
> {{"TEST LINE"}}
> {{"051009620_368b3a92-29fa-4eef-805e-53c23ef7979f"}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to