[jira] [Commented] (PDFBOX-4481) Text extraction error with Thai combined glyph depending on space after it

Tilman Hausherr (JIRA) Tue, 12 Mar 2019 12:13:18 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790892#comment-16790892
 ]


Tilman Hausherr commented on PDFBOX-4481:
-----------------------------------------

The problem in the reduced file is that the space is at the same start position 
as the diacritic. I was able to fix the text extraction in the reduced file by 
removing the "=" in {{if (tp2Xend <= thisXstart || tp2Xstart >= thisXend)}} in 
TextPosition.java but this causes regressions in the complete file. So I'll 
have to create another reduced file and keep searching :(

> Text extraction error with Thai combined glyph depending on space after it
> --------------------------------------------------------------------------
>
>                 Key: PDFBOX-4481
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4481
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.14
>            Reporter: Tilman Hausherr
>            Priority: Major
>              Labels: Thai
>         Attachments: SO54981236-reduced.pdf, SO54981236-reduced.txt, 
> SO54981236.pdf
>
>
> In the first extracted line of the reduced file, the "accent" (somebody 
> please correct me what that thing is) is separate. On the second line it is 
> at the proper place. Content stream:
> {code}
> BT
>   1 0 0 1 67.3 756.98 Tm
>   [ (\000\203\000\227\000q) ] TJ
>   1 0 0 1 77.5 756.98 Tm
>   [ (\000\003) ] TJ
>   1 0 0 1 67.3 730 Tm
>   [ (\000\203\000\227\000q\000\003) ] TJ
> ET
> {code}
> The weird thing is that the "\003" is just a space. So when the space is in 
> the string the extraction works, and when it isn't, it doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-4481) Text extraction error with Thai combined glyph depending on space after it

Reply via email to