[jira] [Comment Edited] (FOP-1969) Surrogate pairs not treated as single unicode codepoint for display purposes

Simone Rondelli (JIRA) Tue, 20 Sep 2016 07:55:55 -0700

    [ 
https://issues.apache.org/jira/browse/FOP-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15506786#comment-15506786
 ]


Simone Rondelli edited comment on FOP-1969 at 9/20/16 2:53 PM:
---------------------------------------------------------------

I see the problem.

{code:java|title=MultiByteFont.java}
private CharSequence mapGlyphsToChars(GlyphSequence gs) {
    int ng = gs.getGlyphCount();
    CharBuffer cb = CharBuffer.allocate(gs.getUTF16CharacterCount());  \\ <-- 
Here
    int ccMissing = Typeface.NOT_FOUND;
    for (int i = 0, n = ng; i < n; i++) {
        int gi = gs.getGlyph(i);
        int cc = findCharacterFromGlyphIndex(gi); \\ <--Problem
        if ((cc == 0) || (cc > 0x10FFFF)) {
            cc = ccMissing;
            log.warn("Unable to map glyph index " + gi
                     + " to Unicode scalar in font '"
                     + getFullName() + "', substituting missing character '"
                     + (char) cc + "'");
        }
        if (cc > 0x00FFFF) {
            int sh;
            int sl;
            cc -= 0x10000;
            sh = ((cc >> 10) & 0x3FF) + 0xD800;
            sl = ((cc >>  0) & 0x3FF) + 0xDC00;
            cb.put((char) sh);
            cb.put((char) sl);
        } else {
            cb.put((char) cc);
        }
    }
    cb.flip();
    return cb;
}
{code}

In Urdu language one character is mapped to multiple glyphs. This sequence is 
enough to make the program crash اآخری. Before my modification the CharBuffer 
was initialized in this way: {{CharBuffer.allocate(gs.getGlyphCount();}}. This 
cause again a BufferOverflow error when you deal with Surrogate Pairs because 
you have one glyph corresponding to multiple characters. This is why I have 
changed it to {{CharBuffer.allocate(gs.getUTF16CharacterCount();}}. Which is 
not working in this case were a single character is mapped to multiple glyphs.

Now the question is: what is the correct way to count the characters into the 
GlyphSequence?

# I could use the GlyphSequence.association list and the content of 
GlyphSequence.characters to count the real number of characters that 
corresponds to the given glyph sequence. The problem that I can see is that the 
{{findCharacterFromGlyphIndex(gi);}} might return a different chars (with 
different sizes) from the ones into GlyphSequence.characters.
# Resize the CharBuffer when it gets full 
# Put the chars into a List and then into a CharBuffer

Any thoughts? 

PS: Why the character is retrieved using {{findCharacterFromGlyphIndex(gi);}} 
instead of using the characters inside the GlyphSequence? 




was (Author: rondelli):
I see the problem.

{code:java|title=MultiByteFont.java}
private CharSequence mapGlyphsToChars(GlyphSequence gs) {
    int ng = gs.getGlyphCount();
    CharBuffer cb = CharBuffer.allocate(gs.getUTF16CharacterCount());  \\ <-- 
Here
    int ccMissing = Typeface.NOT_FOUND;
    for (int i = 0, n = ng; i < n; i++) {
        int gi = gs.getGlyph(i);
        int cc = findCharacterFromGlyphIndex(gi); \\ <--Problem
        if ((cc == 0) || (cc > 0x10FFFF)) {
            cc = ccMissing;
            log.warn("Unable to map glyph index " + gi
                     + " to Unicode scalar in font '"
                     + getFullName() + "', substituting missing character '"
                     + (char) cc + "'");
        }
        if (cc > 0x00FFFF) {
            int sh;
            int sl;
            cc -= 0x10000;
            sh = ((cc >> 10) & 0x3FF) + 0xD800;
            sl = ((cc >>  0) & 0x3FF) + 0xDC00;
            cb.put((char) sh);
            cb.put((char) sl);
        } else {
            cb.put((char) cc);
        }
    }
    cb.flip();
    return cb;
}
{code}

In Urdu language one character is mapped to multiple glyphs. This sequence is 
enough to make the program crash اآخری. Before my modification the CharBuffer 
was initialized in this way: {{CharBuffer.allocate(gs.getGlyphCount();}}. This 
cause again a BufferOverflow error when you deal with Surrogate Pairs because 
you have one glyph corresponding to multiple characters. This is why I have 
changed it to {{CharBuffer.allocate(gs.getUTF16CharacterCount();}}. Which is 
not working in this case were a single character is mapped to multiple glyphs.

Now the question is: what is the correct way to count the characters into the 
GlyphSequence?

# I could use the GlyphSequence.association list and the content of 
GlyphSequence.characters to count the real number of characters that 
corresponds to the given glyph sequence. The problem that I can see is that the 
{{findCharacterFromGlyphIndex(gi);}} might return a different chars (with 
different sizes) from the ones into GlyphSequence.characters.
# Resize the CharBuffer when it gets full 
# Put the chars into a List and then into a CharBuffer

Any thoughts? 



> Surrogate pairs not treated as single unicode codepoint for display purposes
> ----------------------------------------------------------------------------
>
>                 Key: FOP-1969
>                 URL: https://issues.apache.org/jira/browse/FOP-1969
>             Project: FOP
>          Issue Type: Improvement
>          Components: unqualified
>    Affects Versions: trunk
>         Environment: Operating System: All
> Platform: All
>            Reporter: Glenn Adams
>         Attachments: Urdu.zip, pcltest.zip, single-byte.zip, testing.fo, 
> testing.fo, testing.pdf, testing.pdf, testing.xml, testing.xsl, tiffttc.zip
>
>
> unicode codepoints outside of the BMP (base multilingual plane), i.e., whose 
> scalar value is greater than 0xFFFF (65535), are coded as UTF-16 surrogate 
> pairs in Java strings, which pair should be treated as a single codepoint for 
> the purpose of mapping to a glyph in a font (that supports extra-BMP 
> mappings);
> at present, FOP does not correctly handle this case in simple (non complex 
> script) rendering paths;
> furthermore, though some support has been added to handle this in the complex 
> script rendering path, it has not yet been tested, so is not necessarily 
> working there either;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (FOP-1969) Surrogate pairs not treated as single unicode codepoint for display purposes

Reply via email to