For others' information, this is a spin-off from the discussion at <http://bugzilla.mozilla.org/show_bug.cgi?id=182877>. See also <http://bugzilla.mozilla.org/show_bug.cgi?id=183048>. <http://bugzilla.mozilla.org/show_bug.cgi?id=183156>.
Boris, I'm sorry it took me so long to get here. In <aso898$[EMAIL PROTECTED]>, Boris Zbarsky wrote: : I was recently told that nsAString should be assumed to hold UTF-16, not : UCS-2 (in spite of all the evidence to the contrary in the string : module). This is a bit unfair :-). On the surface, UCS-2 seems to be everywhere, but under the hood, it's UTF-16 everywhere. Otherwise, Mozilla wouldn't be able to support the full repertoire of Unicode/ISO 10646, which is rather a severe limitation as a web browser. Plane 2 has been rapidly being filled with Chinese characters Ext. B and Ext. C and Plane 1 has been being filled with mathematical symbols, Gothic, Old Italkian and Desert alphabets. Moilla I18N people have been aware of this all the way along. Unfortunately, sometimes they might have been ambiguous about this (because there was no imminent need to concern themselves with non-BMP characters until recently) and might not have been very efficient in communicating that to other developers. [2] Anyway, I hope this thread will dispell any remaining notion that we use UCS-2 as 'the' internal representation of strings in Mozilla. We use 'UTF-16'. Here are some 'exhibits' to support that :-) For instance, NS_ConvertUTF8toUCS2 works well with 'a 4byte-long sequence in UTF-8' and convert it to two PRUnichars (a pair of surrogate codepoints). [1] The other way around, NS_ConvertUCS2toUTF8 works well with two PRUnichars (a pair of surrogate codepoints) to return 4byte sequence (in UTF-8) corresponding to the surrogate pair. See also nsUTF8ToUnicode.cpp and nsUnicodeToUTF8.cpp. They convert UTF-8 to and from UTF-16 (not UCS-2). Last Saturday I found 'UCS4 converters' don't support surrogate pairs. This may have been overlooked because nobody actually uses UTF-32 in web pages except for test pages as found at <http://jshin.net/i18n/utftest> or <http://www.i18nguy.com/unicode>. Anyway, I filed a bug and I've got a patch. (bug 184120) The choice of names are unfortunate and it'd have been much less confusing had UTF16 been used consistently in place of UCS2. Back in June 2000, Erik wrote that it'd be better to use *UTF16*, but somehow that hasn't been acted upon. See a thread 'NS_ConvertUTF8toUCS2()' at <news:[EMAIL PROTECTED]>, in which he made it clear that 'PRUnichar*' is UTF-16. : This means that all users of nsAString who assume that : .Length() gives the number of characters that will be rendered on-screen : (eg the old button reflow code) need to be fixed... I'm not sure what exactly you meant by the number of characters. In layout, it appears that it's not the number of 'codepoints' but the number of graphemes or the number and the extents of a string of glyphs to render a string of graphemes(represented and stored in nsAString as a seq. of 16bit codeunits) that are important. If that's the case, IMHO, relying (too much) on .Length() is inherently broken even without UTF-16 taken into account. [3] Unicode includes a lot of combining characters and supports a number of complex scripts. (Latin, Cyrillic, and Greek alphabets are complex !! Actually, both Mozilla and MS IE have trouble dealing with combining diacritic marks for Latin/Greek/Cyrillic alphabets while handling rather well more complex Indic scripts. Ironically, text browsers like w3m-m17n running under xterm is better at handling them at Mozilla, MS IE and Opera. [4]) Therefore, counting how many unsigned shorts(16bit) there are in nsAString is of limited use. In layout, editing, and many other places, I believe what is more important is the number of graphemes and other text elements(words, lines, sentences, paragraphs) and their boundaries. See Frank's comments at <http://bugzilla.mozilla.org/show_bug.cgi?id=130441#c24> and also <http://bugzilla.mozilla.org/show_bug.cgi?id=122584>. See also references given in [3]. Supporting UTF-16 is not much different from supporting a sequence of a base character followed by combining characters. High surrogates act as base characters and low surrogates as combining characters. Actually, it's simpler than generic cases of base + comb. characters because high surrogate is always followed by only one 'combining character' (low surrogate) while a base character can be followed by mutliple combining characters. : Just wanted to try : and make this information as public as possible in case people happen to : know of other places in Mozilla that are rolling their own : text-measurement or something in a similar vein. As I wrote above, in text measurement, it's crucial to know the distinction between the number of 16bit codeunits and the actual number and the extent of glyphs to render a string of 16bit codeunits. I can't emphasize too much that this problem is not new but has been with us even without surrogate pairs. Jungshik [1] Characters in BMP(basic multilingual plane) can be represented with 1,2 and 3byte sequence in UTF-8. Non-BMP(plane 1 through plane 16) characters take 4 byte sequence in UTF-8. Plane 17 and beyond need 5bytes and 6bytes in UTF-8. However, they're not reachable by UTF-16. UTC(Unicode Technical Committee) and ISO/IEC JTC1/SC2/WG2 committed themselves to that plane 17 and beyond would never be filled. In UTF-16, a BMP character is represented by a single 16bit integer while a characters in plane 1 through plane 16 is represented by a pair of 16bit integers (the first from [D800-DBFF] and the second from [DC00-DFFF]). [2] See a series of news postings by Erik in a thread 'Encoding wars' in xpcom group. news://news.mozilla.org:[EMAIL PROTECTED] Here you'd find why UTF-16 was chosen over a seemingly clean UTF-32/UCS-4. Java, ECMAscript(Javascript), Win32 API, and MacOS API all use UTF-16(!= UCS-2) so that going to UTF-32/UCS-4 was ruled out. [3] For text boundaries, graphemes, line breaking and so forth, you can refer to - Text boundaries : grapheme, word, sentence, paragraph, etc <http://www.unicode.org/unicode/reports/tr29> - Character encoding model : character vs glyph http://www.unicode.org/unicode/reports/tr17 - Line breaking: <http://www.unicode.org/unicode/reports/tr14> (a serious bug in Hangul Jamo handling. the author has been notified of the issue and is preparing to revise it) - Character Model for WWW : http://www.w3.org/TR/charmod/ - Frank went to a great length to explain this at <http://bugzilla.mozilla.org/show_bug.cgi?id=130441#c24> - Erik's posting on this issue: <news:[EMAIL PROTECTED]>) [4] A sample page to demonstrate that Latin/Greek/Cryillic scripts are also complex. http://www.columbia.edu/kermit/st-erkenwald.html
