Re: UTF-16 (was: Revision of UTF-8 history...)

Markus Scherer Fri, 13 Jun 2003 08:44:54 -0700

Markus Kuhn wrote:

[Fortunately though, UTF-16 remains of little bother to anyone in the
Unix/Plan9 world, where UTF-16 and it's 0x10ffff limit are virtually
unheard of, except for the occasional shaking of heads, and very likely
will remain so.


"... in the Unix/Plan9 world, where UTF-16 ... virtually unheard of..."?
This seems a bit too strong a statement :-)

There are certainly significant amounts of non-trivial software running on Unix/Linux and using UTF-16. There is Java. There is a lot of software that deals with HTML and XML - Mozilla/Netscape and Opera use UTF-16, the XML DOM API is defined in terms of it. KDE uses UTF-16. I think SAP does, as well as probably many other vendors writing portable, Unicode-enabled software. (Do I dare mention little ICU among these large applications and frameworks?)

[Everything Microsoft uses UTF-16, of course, but that is not "in the Unix/Plan9 world".]

... UTF-16 remains an ugly misscarriage, because by placing
the surrogates not at the end of the 16-bit space but into the middle of

I think everyone agrees that it would have been better and more elegant to have the surrogates at the end of the 16-bit range, and to have 0xFFFFF (for example) as the highest code point value. (In practice, it may be odd but does not present any particular difficulty that the highest code point is 0x10FFFF.)

I also believe that most everyone agrees that if Unicode had had available in 1988 or even 1993 the current level of sophistication in fonts and layout engines and the experience with character encoding (including IDS and variation selectors), then it could have stayed with a fixed-width 16-bit form.

Unicode is simply the result of many compromises, and not all of them are pretty.

UTF-16 may be "ugly" to some, but it works. (Before someone jumps in here: I am not saying UTF-8 doesn't! All of UTF-8/16/32 "work".) For processing, it is easier to deal with one-or-two units per code point than one-or-two-or-three-or-four of them, and single-unit performance optimizations are very useful for UTF-16.

the code range, it leads to an incompatible binary sorting order in
B-trees with UCS-4 and UTF-8 and therefore is useless for database
applications that want to hide the internal encoding from the user of
B-tree iterators.]

True but overstated. It is quite easy to write an efficient function that compares UTF-16 strings in code point order.

Now, who invented UTF-16?

Not me ;-)

Markus

Markus

--
Opinions expressed here may not reflect my company's positions unless otherwise noted.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: UTF-16 (was: Revision of UTF-8 history...)

Reply via email to