Tilman Hausherr created PDFBOX-4481:
---------------------------------------

             Summary: Text extraction error with Thai combined glyph depending 
on space after it
                 Key: PDFBOX-4481
                 URL: https://issues.apache.org/jira/browse/PDFBOX-4481
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 2.0.14
            Reporter: Tilman Hausherr
         Attachments: SO54981236-reduced.pdf, SO54981236-reduced.txt, 
SO54981236.pdf

In the first extracted line of the reduced file, the "accent" (somebody please 
correct me what that thing is) is separate. On the second line it is at the 
proper place. Content stream:
{code}
BT
  1 0 0 1 67.3 756.98 Tm
  [ (\000\203\000\227\000q) ] TJ
  1 0 0 1 77.5 756.98 Tm
  [ (\000\003) ] TJ
  1 0 0 1 67.3 730 Tm
  [ (\000\203\000\227\000q\000\003) ] TJ
ET
{code}
The weird thing is that the "\003" is just a space. So when the space is in the 
string the extraction works, and when it isn't, it doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to