Re: UCS-2 vs UTF-16 in nsAString

Jungshik Shin Thu, 12 Dec 2002 19:59:43 -0800

For others' information, this is a spin-off from the discussion
at <http://bugzilla.mozilla.org/show_bug.cgi?id=182877>.
See also 
 <http://bugzilla.mozilla.org/show_bug.cgi?id=183048>.
 <http://bugzilla.mozilla.org/show_bug.cgi?id=183156>.


Boris,

I'm sorry it took me so long to get here.

In <aso898$[EMAIL PROTECTED]>, Boris Zbarsky wrote:
: I was recently told that nsAString should be assumed to hold UTF-16, not 
: UCS-2 (in spite of all the evidence to the contrary in the string 
: module).  

  This is a bit unfair :-). On the surface, UCS-2 seems to be
everywhere, but under the hood, it's UTF-16 everywhere.  Otherwise,
Mozilla wouldn't be able to support the full repertoire of Unicode/ISO
10646, which is rather a severe limitation as a web browser. Plane 2
has been rapidly being filled with Chinese characters Ext. B and Ext. C
and Plane 1 has been being filled with mathematical symbols, Gothic, Old
Italkian and Desert alphabets. Moilla I18N people have been aware
of this all the way along.  Unfortunately, sometimes they might have
been ambiguous about this (because there was no imminent need to concern
themselves with non-BMP characters until recently) and might not have been 
very efficient in communicating that to other developers. [2]

 Anyway, I hope this thread will dispell any remaining notion that we
use UCS-2 as 'the' internal representation of strings in Mozilla. We use
'UTF-16'.  Here are some 'exhibits' to support that :-)

  For instance, NS_ConvertUTF8toUCS2 works well with 'a 4byte-long
sequence in UTF-8' and convert it to two PRUnichars (a pair of
surrogate codepoints).  [1] The other way around, NS_ConvertUCS2toUTF8
works well with two PRUnichars (a pair of surrogate codepoints)
to return 4byte sequence (in UTF-8) corresponding to the surrogate
pair. See also nsUTF8ToUnicode.cpp and nsUnicodeToUTF8.cpp. They
convert UTF-8 to and from UTF-16 (not UCS-2). Last Saturday I found
'UCS4 converters' don't support surrogate pairs. This may have
been overlooked because nobody actually uses UTF-32 in  web pages
except for test pages as found at <http://jshin.net/i18n/utftest> or
<http://www.i18nguy.com/unicode>. Anyway, I filed a bug and I've got a 
patch. (bug 184120)

  The choice of names are unfortunate and it'd
have been much less confusing had UTF16 been used consistently
in place of UCS2.  Back in June 2000, Erik wrote that it'd
be better to use *UTF16*, but somehow that hasn't
been acted upon.  See  a thread 'NS_ConvertUTF8toUCS2()' at
<news:[EMAIL PROTECTED]>, 
in which he made it clear that 'PRUnichar*' is UTF-16. 

  
: This means that all users of nsAString who assume that 
: .Length() gives the number of characters that will be rendered on-screen 
: (eg the old button reflow code) need to be fixed... 

  I'm not sure what exactly you meant by the number of characters.
In layout, it appears that it's not the number of 'codepoints' but the
number of graphemes or the number and the extents  of a string of glyphs
to render a string of graphemes(represented and stored in nsAString as
a seq. of 16bit codeunits) that are important. If that's the case, IMHO,
relying (too much) on .Length() is inherently broken even without UTF-16
taken into account. [3] 

  Unicode includes a lot of combining characters and supports a
number of complex scripts. (Latin, Cyrillic, and Greek alphabets are
complex !! Actually, both Mozilla and MS IE have trouble dealing
with combining diacritic marks for Latin/Greek/Cyrillic alphabets
while handling rather well more complex Indic scripts. Ironically,
text browsers like w3m-m17n running under xterm is better at handling
them at Mozilla, MS IE and Opera. [4]) Therefore, counting how many
unsigned shorts(16bit) there are in nsAString is of limited use.
In layout, editing, and many other places, I believe what is more
important is the number of graphemes and other text elements(words,
lines, sentences, paragraphs) and their boundaries.  See Frank's
comments at <http://bugzilla.mozilla.org/show_bug.cgi?id=130441#c24>
and also <http://bugzilla.mozilla.org/show_bug.cgi?id=122584>. See also
references given in [3]. 

  Supporting UTF-16 is not much different from supporting a sequence
of a base character  followed by combining characters. High surrogates act
as base characters and low surrogates as combining characters.  Actually,
it's simpler than generic cases of base + comb. characters because
high surrogate is always followed by only one 'combining character'
(low surrogate) while a base character can be followed by mutliple
combining characters.


: Just wanted to try 
: and make this information as public as possible in case people happen to 
: know of other places in Mozilla that are rolling their own 
: text-measurement or something in a similar vein.

  As I wrote above, in text measurement, it's crucial to know the
distinction between the number of 16bit codeunits and the actual number
and the extent of glyphs to render a string of 16bit codeunits.  I can't
emphasize too much that this problem is not new but has been with us
even without surrogate pairs.

   Jungshik 

[1] Characters in BMP(basic multilingual plane) can be represented
with 1,2 and 3byte sequence in UTF-8. Non-BMP(plane 1 through plane
16) characters take 4 byte sequence in UTF-8. Plane 17 and beyond
need 5bytes and 6bytes in UTF-8. However, they're not reachable by
UTF-16.  UTC(Unicode Technical Committee) and ISO/IEC JTC1/SC2/WG2
committed themselves to that plane 17 and beyond would never be filled.
In UTF-16, a BMP character is represented by a single 16bit integer while
a characters in plane 1 through plane 16 is represented by a pair of 16bit
integers (the first from [D800-DBFF] and the second from [DC00-DFFF]).

[2] See a series of news postings by Erik in a thread 'Encoding wars' 
in xpcom group.
  news://news.mozilla.org:[EMAIL PROTECTED]

  Here you'd find why UTF-16 was chosen over a seemingly clean
UTF-32/UCS-4. Java, ECMAscript(Javascript), Win32 API, and MacOS API 
all use UTF-16(!= UCS-2)  so that going to UTF-32/UCS-4 was ruled out.

[3]  For text boundaries, graphemes, line breaking and so forth, 
you can refer to 

 - Text boundaries : grapheme, word, sentence, paragraph, etc
  <http://www.unicode.org/unicode/reports/tr29>
 - Character encoding model : character vs glyph
   http://www.unicode.org/unicode/reports/tr17
 - Line breaking:
  <http://www.unicode.org/unicode/reports/tr14>
   (a serious bug in Hangul Jamo handling. the author has been
    notified of the issue and is preparing to revise it)

 - Character Model for WWW : 
   http://www.w3.org/TR/charmod/

 - Frank went to a great length to explain this 
   at  <http://bugzilla.mozilla.org/show_bug.cgi?id=130441#c24>

 - Erik's posting on this issue:
   <news:[EMAIL PROTECTED]>)

[4] A sample page to demonstrate that Latin/Greek/Cryillic scripts 
are also complex.
   http://www.columbia.edu/kermit/st-erkenwald.html

Re: UCS-2 vs UTF-16 in nsAString

Reply via email to