On 17 May 2011 16:03, Boris Zbarsky <[email protected]> wrote: > On 5/17/11 3:29 PM, Wes Garland wrote: > >> The problem is that UTF-16 cannot represent >> all possible code points. >> > > My point is that neither can UTF-8. Can you name an encoding that _can_ > represent the surrogate-range codepoints? >
UTF-8 and UTF-32. I think UTF-7 can, too, but it is not a standard so it's not really worth discussing. UTF-16 is the odd one out. Therefore I stand by my statement: if you allow what to me looks like arrays > "UTF-32 code units and also values that fall into the surrogate ranges" then > you don't get Unicode strings. You get a set of arrays that contains > Unicode strings as a proper subset. > Okay, I think we have to agree to disagree here. I believe my reading of the spec is correct. > There are no such valid UTF-8 strings; see spec quotes above. The proposal > would have involved having invalid pseudo-UTF-ish strings. > Yes, you can encode code points d800 - dfff in UTF-8 Strings. These are not *well-formed* strings, but they are Unicode 8-bit Strings (D81) nonetheless. What you can't do is encode 16-bit code units in UTF-8 Strings. This is because you can only convert from one encoding to another via code points. Code units have no cross-encoding meaning. Further, you can't encode code points d800 - dfff in UTF-16 Strings, leaving you at a loss when you want to store those values in JS Strings (i.e. when using them as uint16[]) except to generate ill-formed UTF-16. I believe it would be far better to treat those values as Unicode code points, not 16-bit code units, and to allow JS String elements to be able to express the whole 21-bit code point range afforded by Unicode. In other words, current mis-use of JS Strings which can store "characters" 0-ffff in ill-formed UTF-16 strings would become use of JS Strings to store code points 0-1FFFFF which may use reserved code points d800-dfff, the high surrogates, which cannot be represented in UTF-16. But CAN be represented, without loss, in UTF-8, UTF-32, and proposed-new-JS-Strings. > If JS Strings were arrays of Unicode code points, this conversion would >> be a non-issue; UTF-8 sequence 0xed 0xb0 0x88 becomes Unicode code point >> 0xdc08, with no incorrect conversion taking place. >> > > Sorry, no. See above. > # printf '\xed\xb0\x88' | iconv -f UTF-8 -t UCS-4BE | od -x 0000000 0000 dc08 0000004 # printf '\000\000\xdc\x08' | iconv -f UCS-4BE -t UTF-8 | od -x 0000000 edb0 8800 0000003 I just don't get it. You can stick the invalid 16-bit value 0xdc08 into a > "UTf-16" string just as easily as you can stick the invalid 24-bit sequence > 0xed 0xb0 0x88 into a "UTF-8" string. Can you please, please tell me what > made you decide there's _any_ difference between the two cases? They're > equally invalid in _exactly_ the same way. > > The difference is that in UTF-8, 0xed 0xb0 0x88 means "The Unicode code point 0xdc08", and in UTF-16 0xdc08 means "Part of some non-BMP code point". Said another way, 0xed in UTF-8 has nearly the same meaning as 0xdc08 in UTF-16. Both are ill-formed code unit subsequences which do not represent a code unit (D84a). Wes -- Wesley W. Garland Director, Product Development PageMail, Inc. +1 613 542 2787 x 102
_______________________________________________ es-discuss mailing list [email protected] https://mail.mozilla.org/listinfo/es-discuss

