I also believe that most everyone agrees that if Unicode had had available in 1988 or even 1993 the current level of sophistication in fonts and layout engines and the experience with character encoding (including IDS and variation selectors), then it could have stayed with a fixed-width 16-bit form.


Composing characters, and context-sensitive character make the value of fixed width per code point somewhat diminished.
(for example you cannot assume that it is safe to break a string at any even codepoint boundary)


UTF-16 may be "ugly" to some, but it works. (Before someone jumps in here: I am not saying UTF-8 doesn't! All of UTF-8/16/32 "work".)
For processing, it is easier to deal with one-or-two units per code point than one-or-two-or-three-or-four of them, and single-unit performance optimizations are very useful for UTF-16.


Hardly, UTF-16 combines the worst aspects of UTF-32 and UTF-8 into one congealed cluster.
It cannot recover from a single byte miss, it is sensitive to machine byte order, it cannot be sorted
naively as a binary object, it cannot be embedded into source code as literals (at best you get
ugly escape sequences), it cannot be used for web-pages or any existing common wire protocols,
it cannot be sanely recommended for any future wire protocols, and its the only one of the three
with no room for expansion, and it actually impinges on a good chunk of the BMP that would
otherwise be useful.


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/



Reply via email to