[ 
https://issues.apache.org/jira/browse/PDFBOX-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Klink updated PDFBOX-4236:
----------------------------------
    Description: 
In the course of answering [this stack overflow 
question|https://stackoverflow.com/q/50664162/1729265] I saw that text 
extraction from the example file  [^pattern3.pdf] exposes an error in the 
diacritic merging code, the wrong base glyph is chosen.

>From the bottom of [my answer|https://stackoverflow.com/a/50679508/1729265] 
>there:

{quote}By the way, your test file exposes an error in the PDFBox determination 
of the base glyph to merge a diacritic with: The "स[1434]ु[1441]न[1418]" is 
meant to be rendered as "सुन", i.e. the vowel sign u "ु" is combined with the 
letter sa "स", but PDFBox combines it with the subsequent letter na "न" as 
"सनु".

The cause is that it determines the letter to combine the diacritic with by its 
origin which here indeed is in the range of the latter letter na "न", but as 
the vowel sign glyph is rendered before its origin (it is drawn in an area with 
a negative x coordinate), PDFBox determines the wrong association:

!SA-U-NA.png! 
{quote}


  was:
In the course of answering [this stack overflow 
question|https://stackoverflow.com/q/50664162/1729265] I saw that text 
extraction from the example file pattern3.pdf exposes an error in the diacritic 
merging code, the wrong base glyph is chosen.

>From the bottom of [my answer|https://stackoverflow.com/a/50679508/1729265] 
>there:

{quote}By the way, your test file exposes an error in the PDFBox determination 
of the base glyph to merge a diacritic with: The "स[1434]ु[1441]न[1418]" is 
meant to be rendered as "सुन", i.e. the vowel sign u "ु" is combined with the 
letter sa "स", but PDFBox combines it with the subsequent letter na "न" as 
"सनु".

The cause is that it determines the letter to combine the diacritic with by its 
origin which here indeed is in the range of the latter letter na "न", but as 
the vowel sign glyph is rendered before its origin (it is drawn in an area with 
a negative x coordinate), PDFBox determines the wrong association.
{quote}

Also see SA-U-NA.png, screen shots of the glyph coordinate ranges.


> PDFTextStripper diacritic merge sometimes chooses wrong base glyph
> ------------------------------------------------------------------
>
>                 Key: PDFBOX-4236
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4236
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 3.0.0 PDFBox
>            Reporter: Michael Klink
>            Priority: Major
>         Attachments: SA-U-NA.png, pattern3.pdf
>
>
> In the course of answering [this stack overflow 
> question|https://stackoverflow.com/q/50664162/1729265] I saw that text 
> extraction from the example file  [^pattern3.pdf] exposes an error in the 
> diacritic merging code, the wrong base glyph is chosen.
> From the bottom of [my answer|https://stackoverflow.com/a/50679508/1729265] 
> there:
> {quote}By the way, your test file exposes an error in the PDFBox 
> determination of the base glyph to merge a diacritic with: The 
> "स[1434]ु[1441]न[1418]" is meant to be rendered as "सुन", i.e. the vowel sign 
> u "ु" is combined with the letter sa "स", but PDFBox combines it with the 
> subsequent letter na "न" as "सनु".
> The cause is that it determines the letter to combine the diacritic with by 
> its origin which here indeed is in the range of the latter letter na "न", but 
> as the vowel sign glyph is rendered before its origin (it is drawn in an area 
> with a negative x coordinate), PDFBox determines the wrong association:
> !SA-U-NA.png! 
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to