Re: Full Unicode strings strawman

Boris Zbarsky Tue, 17 May 2011 17:09:32 -0700

On 5/17/11 5:24 PM, Wes Garland wrote:

UTF-8 and UTF-32.  I think UTF-7 can, too, but it is not a standard so
it's not really worth discussing.  UTF-16 is the odd one out.


That's not what the spec says.

Okay, I think we have to agree to disagree here. I believe my reading of
the spec is correct.


Sorry, but no...  how much more clear can the spec get?

    There are no such valid UTF-8 strings; see spec quotes above.  The
    proposal would have involved having invalid pseudo-UTF-ish strings.


Yes, you can encode code points d800 - dfff in UTF-8 Strings.  These are
not /well-formed/ strings, but they are Unicode 8-bit Strings (D81)
nonetheless.

The spec seems to pretty clearly define UTF-8 strings as things that doNOT contain the encoding of those code points. If you think otherwise,cite please.

Further, you can't encode code points d800 - dfff in UTF-16 Strings,

Where does the spec say this? And why does that part of the spec notapply to UTF-8?

# printf '\xed\xb0\x88' | iconv -f UTF-8 -t UCS-4BE | od -x
0000000 0000 dc08
0000004
# printf '\000\000\xdc\x08' | iconv -f UCS-4BE -t UTF-8 | od -x
0000000 edb0 8800
0000003

As far as I can tell, that second conversion is just an implementationbug per the spec. See the part I quoted which explicitly says that anencoder in that situation must stop and return an error.

The difference is that in UTF-8, 0xed 0xb0 0x88 means "The Unicode code
point 0xdc08"

According to the spec you were citing, that code unit sequence means aUTF-8 decoder should error, no?


-Boris
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode strings strawman

Reply via email to