[
https://issues.apache.org/jira/browse/PDFBOX-4834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108066#comment-17108066
]
Michael Klink commented on PDFBOX-4834:
---------------------------------------
For example in line 1 where "है।" is extracted as "ह।ै", the diacritical mark
is not merged because PDFBox expects the lines from glyph origin to glyph
origin plus glyph width of the base character glyph and the diacritical mark
glyph to overlap. In this case they don't because the actual drawing of the
mark is to the left of the mark glyph origin and the mark origin is slightly to
the right of the base character origin:
{noformat}
[ (\)) -12.6 (+) 12.7 (,) -0.1 ] TJ
{noformat}
')' maps to "ह", '+' maps to the diacritical mark, and ',' maps to "।". The
mark has a zero width.
Thus, "ह" is drawn, then after a _right_ shift by 12.6 the diacritical is
drawn, no baseline overlap, so no diacritical merge in PDFBox, even though the
actual drawings do overlap because the diacritical is drawn *left* of its
origin. Then after a _left_ shift by 12.7 "।" is drawn, its origin even
slightly before the origin of the mark glyph, resulting in "ह।ै".
Unless PDFBox starts to take the actual glyph drawing into consideration or
stops requiring the baseline overlap (which had been introduced to correctly
recognize diacriticals in other documents), it won't be able to merge
diacriticals like this example in your document.
> Wrong read characters for Hindi conjuncts
> -----------------------------------------
>
> Key: PDFBOX-4834
> URL: https://issues.apache.org/jira/browse/PDFBOX-4834
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.19
> Environment: Windows 10, Java 9.
> Reporter: Hesham
> Priority: Minor
> Attachments: PDFBOX-4834-Hindi.pdf
>
>
> When reading this Hindi PDF book using PDFBox 2.0.19:
> [https://dl.dropboxusercontent.com/s/laixlb5omvjqr7y/Hindi%20Book.pdf?dl=0]
>
> It reads it with some wrong characters for conjuncts as it appears in this
> file:
> [https://dl.dropboxusercontent.com/s/efyxz2eg37gvn4c/Text%20read%20by%20PDFBox%202.0.19.txt?dl=0]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]