[
https://issues.apache.org/jira/browse/PDFBOX-4481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr updated PDFBOX-4481:
------------------------------------
Labels: Thai (was: )
> Text extraction error with Thai combined glyph depending on space after it
> --------------------------------------------------------------------------
>
> Key: PDFBOX-4481
> URL: https://issues.apache.org/jira/browse/PDFBOX-4481
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.14
> Reporter: Tilman Hausherr
> Priority: Major
> Labels: Thai
> Attachments: SO54981236-reduced.pdf, SO54981236-reduced.txt,
> SO54981236.pdf
>
>
> In the first extracted line of the reduced file, the "accent" (somebody
> please correct me what that thing is) is separate. On the second line it is
> at the proper place. Content stream:
> {code}
> BT
> 1 0 0 1 67.3 756.98 Tm
> [ (\000\203\000\227\000q) ] TJ
> 1 0 0 1 77.5 756.98 Tm
> [ (\000\003) ] TJ
> 1 0 0 1 67.3 730 Tm
> [ (\000\203\000\227\000q\000\003) ] TJ
> ET
> {code}
> The weird thing is that the "\003" is just a space. So when the space is in
> the string the extraction works, and when it isn't, it doesn't.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]