30 Sep 2001 22:28:52 +0900, Jens Petersen <[EMAIL PROTECTED]> pisze:
> 16 bits is enough to describe the Basic Multilingual Plane > and I think 24 bits all the currently defined extended > planes. So I guess the report just refers to the BMP. In early days the Unicode Consortium was doing everything to confuse peoble about whether Unicode fits into 16 bits. It used to push the view that it's based on 16-bit units, and that pairs of units from the range U+D800..DFFF (called surrogates) can encode a million of extra characters (none of which had a more specific meaning defined at that time). I was told on the Unicode list that it was done because for some people it would be hard to accept an encoding which requires *more* than twice as much storage as 8-bit charsets. 16 bits is "only" twice as much. Unfortunately some companies, like Microsoft and Oracle, believed the "lie of Unicode marketing" and adopted the 16-bit view as the basic internal and external format, ignoring the issue of surrogates. Some time ago the Unicode Consortium slowly began switching to the point of view that abstract characters are denoted by numbers in the range U+0000..10FFFF. Storing them in 16-bit units by expressing characters below U+FFFF directly and representing others as pairs of surrogates is just a way to serialize Unicode to streams of bytes (or 16-bit words), called UTF-16. AFAIK UTF-8 was first present in ISO-10646-1. The ISO standard, although sharing actual assignments of characters to numbers with Unicode, from the beginning viewed character codes as 31-bit numbers, which can be serialized for transmission using for example UTF-8 or UTF-16. Unicode adopted UTF-8 by cutting it at the point of U+10FFFF. It also invented UTF-32 which means to just store characters in 32-bit words (endianness issues are analogous to UTF-16), but is explicitly restricted to characters below U+10FFFF, to avoid confusion with unrestricted 31-bit codes of ISO-10646-1. So now UTF-8, UTF-16 and UTF-32 are treated in parallel by Unicode. The ISO standard is going to match this and limit itself to U+10FFFF too, which in theory should end the problem about the number of characters in these standards. Unicode had to do something with this because it finally began adding characters above U+FFFF, and it would really make no sense to treat UTF-16 as the fundamental view, saying that some codes really don't represent characters but must be used in pairs, since character properties are defined in terms of real characters, not components of surrogate pairs individually. Surrogates are just a hole in the middle of the first 64k of characters, because UTF-16 can't encode them insolated. Unfortunately the 16-bit view is still widespread and there is much confusion. Companies invested money in the 16-bit Unicode and they can't simply replace it with something entirely different, so they actually begin implementing UTF-16. In practice support for surrogates could be almost non-existant in the past, but now there are actual characters allocated there, so it must be done, despite the pain of using a variable-length encoding. There are cases like Oracle which ignored surrogates and misimplemented UTF-8 by treating surrogates like other characters below U+FFFF, yet calling it UTF-8. Now instead of fixing their mistake they added the real UTF-8 under a strange name AL24UTFFSS (I'm not sure if they finally fixed the names) and are trying to push their old version as an official alternative to UTF-8. There is a very strong opposition, but they are still trying. IMHO it would have been better to not invent UTF-16 at all and use UTF-8 in parallel with UTF-32. But Unicode used to promote UTF-16 as the real Unicode, and now it causes so many threads on Unicode list to clear the confusion about the nature of characters above U+FFFF. -- __("< Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTĘPCZA QRCZAK _______________________________________________ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell