Markus Kuhn <Markus dot Kuhn at cl dot cam dot ac dot uk> wrote: > UTF-16 remains an ugly misscarriage, because by placing > the surrogates not at the end of the 16-bit space but into the middle > of the code range, it leads to an incompatible binary sorting order in > B-trees with UCS-4 and UTF-8 and therefore is useless for database > applications that want to hide the internal encoding from the user of > B-tree iterators.]
I wasn't there, but I'm sure the creators of UTF-16 would have loved to put the surrogates at the end of the 16-bit code space, if only characters hadn't already been assigned there that were probably in much greater use from the outset than the Korean syllables (which were subsequently moved, resulting in a lot of criticism of Unicode for its "instability"). Of course, putting the surrogates at the end of the code space would have meant the logic surrounding U+FEFF (BOM) vs. U+FFFE (unassignable code point, used for endian checking) would have had to be re-thought somewhat. If UTF-16 had been designed into the architecture at the beginning, many of these historical decisions could have been made differently. But the original vision for Unicode, at least for some, was to encode only the most commonly used characters (not every Han character ever listed in a dictionary, not all 11,172 modern Hangul syllables, not hundreds of Arabic contextual forms) and leave lesser-used characters to the Private Use Area. Reducing the scope in this way would have made the original vision of fitting everything into 16 bits much more realistic. > It appears that Miller deserves credit for recognizing that UTF-1 was > of no use whatsoever, That seems an overstatement. It's true that UTF-1 didn't protect the ASCII slash and other similarly important characters from appearing in multi-byte representations, which rendered it useless for file names. It was also slower in implementation than UTF-8, because it used integer division instead of bit shifting (no word on whether this ever made a practical difference, though). But as an encoding to be used *within* files (not in file names), UTF-1 had some advantages over UTF-8 that we used to hear quite a bit on this list, oh, maybe five years ago: Latin-1 legibility and non-use of C1 control characters. Latin-1 characters were encoded in UTF-1 by prepending 0xA0 (NO-BREAK SPACE), which made them fairly readable when rendered by a Unicode-ignorant display engine. UTF-8 does something similar (prepending Â) for Latin-1 symbols below 0xC0, but the Latin letters starting at 0xC0 are not readable at all. This may not seem to be a big deal now, but back in the mid-'90s it was a HUGE problem for some people. Also, as we know, UTF-8 uses bytes in the C1 control range (0x80 to 0x9F) as continuation bytes in multi-byte sequences. People uses to complain mightily about how this broke terminal programs that interpreted these bytes before the UTF-8 decoder had a chance to see them, and performed control functions that might switch character sets or even hang the terminal. Again, much software is now built to understand UTF-8, but that wasn't the case just a few years ago. UTF-1 protected C1 bytes, but of course used printable ASCII instead (which led to different problems). I wouldn't recommend that we all drop everything and switch to UTF-1, but it was not 100 percent evil. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
