[ 
https://issues.apache.org/jira/browse/PDFBOX-4834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108066#comment-17108066
 ] 

Michael Klink commented on PDFBOX-4834:
---------------------------------------

For example in line 1 where "है।" is extracted as "ह।ै", the diacritical mark 
is not merged because PDFBox expects the lines from glyph origin to glyph 
origin plus glyph width of the base character glyph and the diacritical mark 
glyph to overlap. In this case they don't because the actual drawing of the 
mark is to the left of the mark glyph origin and the mark origin is slightly to 
the right of the base character origin:

{noformat}
[ (\)) -12.6 (+) 12.7 (,) -0.1 ] TJ
{noformat}

')' maps to "ह", '+' maps to the diacritical mark, and ',' maps to "।". The 
mark has a zero width.

Thus, "ह" is drawn, then after a _right_ shift by 12.6 the diacritical is 
drawn, no baseline overlap, so no diacritical merge in PDFBox, even though the 
actual drawings do overlap because the diacritical is drawn *left* of its 
origin. Then after a _left_ shift by 12.7 "।" is drawn, its origin even 
slightly before the origin of the mark glyph, resulting in "ह।ै".

Unless PDFBox starts to take the actual glyph drawing into consideration or 
stops requiring the baseline overlap (which had been introduced to correctly 
recognize diacriticals in other documents), it won't be able to merge 
diacriticals like this example in your document.

> Wrong read characters for Hindi conjuncts
> -----------------------------------------
>
>                 Key: PDFBOX-4834
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4834
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.19
>         Environment: Windows 10, Java 9.
>            Reporter: Hesham
>            Priority: Minor
>         Attachments: PDFBOX-4834-Hindi.pdf
>
>
> When reading this Hindi PDF book using PDFBox 2.0.19:
> [https://dl.dropboxusercontent.com/s/laixlb5omvjqr7y/Hindi%20Book.pdf?dl=0]
>  
> It reads it with some wrong characters for conjuncts as it appears in this 
> file:
> [https://dl.dropboxusercontent.com/s/efyxz2eg37gvn4c/Text%20read%20by%20PDFBox%202.0.19.txt?dl=0]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to