Lapo Luchini created PDFBOX-5411:
------------------------------------

             Summary: PDFTextStripper could use text size in reconstruction
                 Key: PDFBOX-5411
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5411
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 2.0.25, 3.0.0 PDFBox
            Reporter: Lapo Luchini
         Attachments: image-2022-04-08-16-13-17-334.png

When two texts are partially overlapping {{PDFTextStripper}} seems to return a 
mix simply based on "leftmost x coordinate of the glyph", which makes sense, 
but it could make use of glyph size to disambiguate "easy" cases like this one:

!image-2022-04-08-16-13-17-334.png!

currently this is the first parameter of PDFTextStripper.writeString(String 
string, List<TextPosition> textPositions):

{{"T0510E09620_S368b3aT92-29fa -4Leef-80I5e-N53c23efE7979f"}}

I would of course hope for two calls:

{{"TEST LINE"}}
{{"051009620_368b3a92-29fa-4eef-805e-53c23ef7979f"}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to