Lapo Luchini created PDFBOX-5411:
------------------------------------
Summary: PDFTextStripper could use text size in reconstruction
Key: PDFBOX-5411
URL: https://issues.apache.org/jira/browse/PDFBOX-5411
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 2.0.25, 3.0.0 PDFBox
Reporter: Lapo Luchini
Attachments: image-2022-04-08-16-13-17-334.png
When two texts are partially overlapping {{PDFTextStripper}} seems to return a
mix simply based on "leftmost x coordinate of the glyph", which makes sense,
but it could make use of glyph size to disambiguate "easy" cases like this one:
!image-2022-04-08-16-13-17-334.png!
currently this is the first parameter of PDFTextStripper.writeString(String
string, List<TextPosition> textPositions):
{{"T0510E09620_S368b3aT92-29fa -4Leef-80I5e-N53c23efE7979f"}}
I would of course hope for two calls:
{{"TEST LINE"}}
{{"051009620_368b3a92-29fa-4eef-805e-53c23ef7979f"}}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]