Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Xmlgraphics-fop Wiki" 
for change notification.

The "SurrogatePairs" page has been changed by SimoneRondelli:
https://wiki.apache.org/xmlgraphics-fop/SurrogatePairs

New page:
<<TableOfContents(2)>>

== Overview ==
ApacheFOP treats surrogates pairs as 2 different codepoints. Most of the 
methods accepts/returns a single char (Eg Typeface.mapChar(char c)) which mean 
they can deal only with BMP characters (<= 0xFFFF).

In order to correctly handle the non-BMP characters (Eg: Emoji, Mathematical 
symbols, ancient scripts, CJK extensions) ApacheFOP should deal with int rather 
then char. It is possible to represent the whole Unicode range using a single 
int while it is not possible with a single UTF-16 char.

These are the main aspects of this modification:

 1. Read the non-BMP glyphs from the font
 1. Make the API to use int instead of char
 1. Convert surrogate pairs to a single int
 1. Adapt the renderer

== Read the non-BMP glyphs from the font ==

The glyph information are stored in one of the font CMAPs tables. The 
implemented one is:

 * PlatformID: 3 (Microsoft)
 * EncodingID: 10 (Unicode UCS-4)
 * CMAP format: 12 (Formats 8, 10, and 12, 13, and 14 are used for mixed 
16/32-bit and pure 32-bit mappings. This supports text encoded with surrogates 
in Unicode 2.0 and later)

Apple TrueType Reference Manual: 
(https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6cmap.html)

== Make the API to use int instead of char ==

Such modification would mainly affects the Font classes hierarchy. The Typeface 
class is one of the base classes of the Font Hierarchy and is one of the 
classes that should be modified. It has though (as of September 2016) 
approximately 27 subclasses/implementations which would make the scope of the 
modification pretty huge.

Since not all the font classes are supposed to deal with non-BMP codepoints it 
is possible to narrow down the the scope of the modification to a lower number 
of classes. This is supposed to be just a step that allow to have at least a 
working path to handle surrogate pairs.

The class identified as good point to start is CIDFont as a CID Fonts has been 
designed to handle huge character sets. From Adobe documentation: "CID fonts 
are a new format of composite (multibyte) Type 1 fonts that better address the 
requirements of Far East markets. Adobe developed the CID-keyed font file 
format to support large character set fonts..." (src: 
http://www.adobe.com/products/postscript/pdfs/cid.pdf).


{{{
FontMetrics
    Typeface
        SystemFontMetricsMapper
        LazyFont
        AFPFont
            RasterFont
            AbstractOutlineFont
                DoubleByteFont
                OutlineFont
                    AFPTrueTypeFont in AFPFontConfig
        Base14Font
            Helvetica
            Symbol
            HelveticaBoldOblique
            HelveticaOblique
            HelveticaBold
            ZapfDingbats
            Courier
            CourierBold
            TimesBold
            TimesBoldItalic
            TimesItalic
            TimesRoman
            CourierOblique
            CourierBoldOblique
        CustomFontMetricsMapper
        CustomFont
            SingleByteFont                + hasCodePoint(int):boolean
            CIDFont   <------------------ + mapCodePoint(int):int
                MultiByteFont

                                         ~ getUnicode(int):char -> 
getUnicode(int):int
CIDSet (Used by CIDFont) <-------------- + mapCodePoint(int, int):int
    CIDSubset
    CIDFull
                                          + hasCodePoint(int):boolean
Font <----------------------------------- + mapCodePoint(int):int
}}}

getUnicode(): is defined in CIDSet (is not a property of the Typeface class or 
one of its subclasses). I changed the firm of this method to handle int instead 
of char because it is semantically incorrect to represent unicode with a single 
UTF-16 char. As you can see from the CIDSet hierarchy the change affect only 3 
classes.

getUnicodeFromGID(): this method is defined in CustomFont and CIDSet. It never 
get called from the MultiByteFont path, probably because getUnicode is used 
instead. That is why I'm down casting the return value from int to char in 
CIDFull and CIDSubset. Probably the best thing to do would be to get rid of 
this method or make it handle int, but again the change would affect more 
classes then the ones in the scope.

== Convert surrogate pairs to a single int ==

The data arrives as String and non-BMP characters are represented as surrogate 
pairs. Every time some operation is performed on the data (eg. 
Font.mapCodePoint(int)) surrogate pairs should be converted to the 
corresponding code point.

The current implementation make this conversion inside the for loops used to 
deal with the data:

{{{#!java
for (int i = 0, i < text.length(); i++) {
    int cp = text.charAt(i);

    if (CharUtilities.containsSurrogatePairAt(text, i)) { // Throw an exception 
if it is an ill-formed surrogate pair
        c = Character.toCodePoint((char) c, text.charAt(++i));
    }
    [...]
}
}}}

or

{{{#!java
for (int i = 0, i < text.length(); i++) {
    int cp = text.codePointAt(i); // Java API, do NOT throw error if it is an 
ill-formed surrogate pair

    i += CharUtilities.incrementIfNonBMP(orgChar);
    [...]
}
}}}

The best thing to do in future is implement something like the Java8 API 
String.codepoints() which allow you to directly iterate through a stream/array 
of codepoints avoiding boilerplate code.

== Adapt the Renderer ==

Every ApacheFOIP output format has it's own way to represent data which means 
that each renderer need to be adapted to handle non-BMP codepoints.

The adapted Renderer/Painters are:

 * PDFPainter
 * PSPainter
 * Java2DPainter
 * Java2DRenderer

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to