[ 
https://issues.apache.org/jira/browse/PDFBOX-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16827647#comment-16827647
 ] 

Tilman Hausherr edited comment on PDFBOX-4189 at 4/27/19 4:01 PM:
------------------------------------------------------------------

This has been a year and I wanted to look what's going on and concentrated on 
the first word ( আমি ).

example2 has incorrect visual glyph sequence but correct text extraction.

example 3 has correct visual glyphs sequence but incorrect text extraction.

The "scythe" ি  (= "BENGALI VOWEL SIGN I") is painted to the left of the 
consonant it is "influencing", but when composed with an editor, it is to be 
after it.

WORD solves this that the "scythe" glyph maps to the consonant in the ToUnicode 
table: [^bengali-word-lohit-good.pdf] 

This somehow looked suspicious and I wondered what would happen if I'd use the 
"scythe" glyph with two different consonants: ( আিমি ). The result was 
[^bengali-word-lohit-bad.pdf] and the glyphs look good, but the text extraction 
is wrong 🤣. So that is really funny, but the downside is that for now, we have 
no "gold standard" to look up to.


was (Author: tilman):
This has been a year and I wanted to look what's going on and concentrated on 
the first word ( আমি ).

example2 has incorrect visual glyph sequence but correct text extraction.

example 3 has correct visual glyphs sequence but incorrect text extraction.

The "scythe" ি  (= "BENGALI VOWEL SIGN I") is painted to the left of the 
consonant it is "influencing", but when composed with an editor, it is to be 
after it.

WORD solves this that the "scythe" glyph maps to the consonant in the ToUnicode 
table: [^bengali-word-lohit-good.pdf] 

This somehow looked suspicious and I wondered what would happen if I'd use the 
"scythe" glyph with two different consonants. The result was 
[^bengali-word-lohit-bad.pdf] and the glyphs look good, but the text extraction 
is wrong 🤣. So that is really funny, but the downside is that for now, we have 
no "gold standard" to look up to.

> Enable PDF creation with Indian languages, by reading and utilizing the GSUB 
> table
> ----------------------------------------------------------------------------------
>
>                 Key: PDFBOX-4189
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4189
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: FontBox, PDModel
>            Reporter: Palash Ray
>            Priority: Major
>         Attachments: Bengali-text-after.pdf, Bengali-text-before.pdf, 
> BengaliPdfGenerationHelloWorld.java, bengali-example.pdf, 
> bengali-example2.pdf, bengali-example3.pdf, bengali-word-lohit-bad.pdf, 
> bengali-word-lohit-good.pdf, committed.patch, screenshot.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Implemented proper rendering of Indian languages, which need extensive Glyph 
> substitution. The GSUB table has been read and used effectively to replace 
> some compound words with their respective Glyphs. All tests are passing. I 
> have tested this for the Bengali font. Please review these changes and let me 
> know if it makes sense to incorporate these.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to