On 5/17/11 5:24 PM, Wes Garland wrote:
UTF-8 and UTF-32. I think UTF-7 can, too, but it is not a standard so
it's not really worth discussing. UTF-16 is the odd one out.
That's not what the spec says.
Okay, I think we have to agree to disagree here. I believe my reading of
the spec is correct.
Sorry, but no... how much more clear can the spec get?
There are no such valid UTF-8 strings; see spec quotes above. The
proposal would have involved having invalid pseudo-UTF-ish strings.
Yes, you can encode code points d800 - dfff in UTF-8 Strings. These are
not /well-formed/ strings, but they are Unicode 8-bit Strings (D81)
nonetheless.
The spec seems to pretty clearly define UTF-8 strings as things that do
NOT contain the encoding of those code points. If you think otherwise,
cite please.
Further, you can't encode code points d800 - dfff in UTF-16 Strings,
Where does the spec say this? And why does that part of the spec not
apply to UTF-8?
# printf '\xed\xb0\x88' | iconv -f UTF-8 -t UCS-4BE | od -x
0000000 0000 dc08
0000004
# printf '\000\000\xdc\x08' | iconv -f UCS-4BE -t UTF-8 | od -x
0000000 edb0 8800
0000003
As far as I can tell, that second conversion is just an implementation
bug per the spec. See the part I quoted which explicitly says that an
encoder in that situation must stop and return an error.
The difference is that in UTF-8, 0xed 0xb0 0x88 means "The Unicode code
point 0xdc08"
According to the spec you were citing, that code unit sequence means a
UTF-8 decoder should error, no?
-Boris
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss