[ 
https://issues.apache.org/jira/browse/PDFBOX-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Klink updated PDFBOX-4236:
----------------------------------
    Priority: Minor  (was: Major)

> PDFTextStripper diacritic merge sometimes chooses wrong base glyph
> ------------------------------------------------------------------
>
>                 Key: PDFBOX-4236
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4236
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 3.0.0 PDFBox
>            Reporter: Michael Klink
>            Priority: Minor
>         Attachments: SA-U-NA.png, pattern3.pdf
>
>
> In the course of answering [this stack overflow 
> question|https://stackoverflow.com/q/50664162/1729265] I saw that text 
> extraction from the example file  [^pattern3.pdf] exposes an error in the 
> diacritic merging code, the wrong base glyph is chosen.
> From the bottom of [my answer|https://stackoverflow.com/a/50679508/1729265] 
> there:
> {quote}By the way, your test file exposes an error in the PDFBox 
> determination of the base glyph to merge a diacritic with: The 
> "स[1434]ु[1441]न[1418]" is meant to be rendered as "सुन", i.e. the vowel sign 
> u "ु" is combined with the letter sa "स", but PDFBox combines it with the 
> subsequent letter na "न" as "सनु".
> The cause is that it determines the letter to combine the diacritic with by 
> its origin which here indeed is in the range of the latter letter na "न", but 
> as the vowel sign glyph is rendered before its origin (it is drawn in an area 
> with a negative x coordinate), PDFBox determines the wrong association:
> !SA-U-NA.png! 
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to