[
https://issues.apache.org/jira/browse/PDFBOX-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Klink updated PDFBOX-4236:
----------------------------------
Priority: Minor (was: Major)
> PDFTextStripper diacritic merge sometimes chooses wrong base glyph
> ------------------------------------------------------------------
>
> Key: PDFBOX-4236
> URL: https://issues.apache.org/jira/browse/PDFBOX-4236
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 3.0.0 PDFBox
> Reporter: Michael Klink
> Priority: Minor
> Attachments: SA-U-NA.png, pattern3.pdf
>
>
> In the course of answering [this stack overflow
> question|https://stackoverflow.com/q/50664162/1729265] I saw that text
> extraction from the example file [^pattern3.pdf] exposes an error in the
> diacritic merging code, the wrong base glyph is chosen.
> From the bottom of [my answer|https://stackoverflow.com/a/50679508/1729265]
> there:
> {quote}By the way, your test file exposes an error in the PDFBox
> determination of the base glyph to merge a diacritic with: The
> "स[1434]ु[1441]न[1418]" is meant to be rendered as "सुन", i.e. the vowel sign
> u "ु" is combined with the letter sa "स", but PDFBox combines it with the
> subsequent letter na "न" as "सनु".
> The cause is that it determines the letter to combine the diacritic with by
> its origin which here indeed is in the range of the latter letter na "न", but
> as the vowel sign glyph is rendered before its origin (it is drawn in an area
> with a negative x coordinate), PDFBox determines the wrong association:
> !SA-U-NA.png!
> {quote}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]