[
https://issues.apache.org/jira/browse/PDFBOX-5411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lapo Luchini updated PDFBOX-5411:
---------------------------------
Attachment: textDoubleText.pdf
> PDFTextStripper could use text size in reconstruction
> -----------------------------------------------------
>
> Key: PDFBOX-5411
> URL: https://issues.apache.org/jira/browse/PDFBOX-5411
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.25, 3.0.0 PDFBox
> Reporter: Lapo Luchini
> Priority: Minor
> Attachments: image-2022-04-08-16-13-17-334.png, textDoubleText.pdf
>
>
> When two texts are partially overlapping {{PDFTextStripper}} seems to return
> a mix simply based on "leftmost x coordinate of the glyph", which makes
> sense, but it could make use of glyph size to disambiguate "easy" cases like
> this one:
> !image-2022-04-08-16-13-17-334.png!
> currently this is the first parameter of PDFTextStripper.writeString(String
> string, List<TextPosition> textPositions):
> {{"T0510E09620_S368b3aT92-29fa -4Leef-80I5e-N53c23efE7979f"}}
> I would of course hope for two calls:
> {{"TEST LINE"}}
> {{"051009620_368b3a92-29fa-4eef-805e-53c23ef7979f"}}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]