[
https://issues.apache.org/jira/browse/PDFBOX-4481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790892#comment-16790892
]
Tilman Hausherr commented on PDFBOX-4481:
-----------------------------------------
The problem in the reduced file is that the space is at the same start position
as the diacritic. I was able to fix the text extraction in the reduced file by
removing the "=" in {{if (tp2Xend <= thisXstart || tp2Xstart >= thisXend)}} in
TextPosition.java but this causes regressions in the complete file. So I'll
have to create another reduced file and keep searching :(
> Text extraction error with Thai combined glyph depending on space after it
> --------------------------------------------------------------------------
>
> Key: PDFBOX-4481
> URL: https://issues.apache.org/jira/browse/PDFBOX-4481
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.14
> Reporter: Tilman Hausherr
> Priority: Major
> Labels: Thai
> Attachments: SO54981236-reduced.pdf, SO54981236-reduced.txt,
> SO54981236.pdf
>
>
> In the first extracted line of the reduced file, the "accent" (somebody
> please correct me what that thing is) is separate. On the second line it is
> at the proper place. Content stream:
> {code}
> BT
> 1 0 0 1 67.3 756.98 Tm
> [ (\000\203\000\227\000q) ] TJ
> 1 0 0 1 77.5 756.98 Tm
> [ (\000\003) ] TJ
> 1 0 0 1 67.3 730 Tm
> [ (\000\203\000\227\000q\000\003) ] TJ
> ET
> {code}
> The weird thing is that the "\003" is just a space. So when the space is in
> the string the extraction works, and when it isn't, it doesn't.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]