On 5/17/11 5:24 PM, Wes Garland wrote:
UTF-8 and UTF-32.  I think UTF-7 can, too, but it is not a standard so
it's not really worth discussing.  UTF-16 is the odd one out.

That's not what the spec says.

Okay, I think we have to agree to disagree here. I believe my reading of
the spec is correct.

Sorry, but no...  how much more clear can the spec get?

    There are no such valid UTF-8 strings; see spec quotes above.  The
    proposal would have involved having invalid pseudo-UTF-ish strings.


Yes, you can encode code points d800 - dfff in UTF-8 Strings.  These are
not /well-formed/ strings, but they are Unicode 8-bit Strings (D81)
nonetheless.

The spec seems to pretty clearly define UTF-8 strings as things that do NOT contain the encoding of those code points. If you think otherwise, cite please.

Further, you can't encode code points d800 - dfff in UTF-16 Strings,

Where does the spec say this? And why does that part of the spec not apply to UTF-8?

# printf '\xed\xb0\x88' | iconv -f UTF-8 -t UCS-4BE | od -x
0000000 0000 dc08
0000004
# printf '\000\000\xdc\x08' | iconv -f UCS-4BE -t UTF-8 | od -x
0000000 edb0 8800
0000003

As far as I can tell, that second conversion is just an implementation bug per the spec. See the part I quoted which explicitly says that an encoder in that situation must stop and return an error.

The difference is that in UTF-8, 0xed 0xb0 0x88 means "The Unicode code
point 0xdc08"

According to the spec you were citing, that code unit sequence means a UTF-8 decoder should error, no?

-Boris
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Reply via email to