Tilman Hausherr created PDFBOX-4481:
---------------------------------------
Summary: Text extraction error with Thai combined glyph depending
on space after it
Key: PDFBOX-4481
URL: https://issues.apache.org/jira/browse/PDFBOX-4481
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 2.0.14
Reporter: Tilman Hausherr
Attachments: SO54981236-reduced.pdf, SO54981236-reduced.txt,
SO54981236.pdf
In the first extracted line of the reduced file, the "accent" (somebody please
correct me what that thing is) is separate. On the second line it is at the
proper place. Content stream:
{code}
BT
1 0 0 1 67.3 756.98 Tm
[ (\000\203\000\227\000q) ] TJ
1 0 0 1 77.5 756.98 Tm
[ (\000\003) ] TJ
1 0 0 1 67.3 730 Tm
[ (\000\203\000\227\000q\000\003) ] TJ
ET
{code}
The weird thing is that the "\003" is just a space. So when the space is in the
string the extraction works, and when it isn't, it doesn't.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]