Re: Full Unicode strings strawman

Boris Zbarsky Mon, 16 May 2011 14:07:44 -0700

On 5/16/11 4:37 PM, Mike Samuel wrote:

You might have.  If you reject my assertion about option 2 above, then
to clarify,
The UTF-16 representation of codepoint U+10000 is the code-unit pair
U+D8000 U+DC000.

No. The UTF-16 representation of codepoint U+10000 is the code-unitpair 0xD800 0xDC00. These are 16-bit unsigned integers, NOT Unicodecharacters (which is what the U+NNNNN notation means).

The UTF-16 representation of codepoint U+D8000 is the single code-unit
U+D8000 and similarly for U+DC00.


I'm assuming you meant U+D800 in the first two code-units there.

There is no Unicode codepoint U+D800 or U+DC00. Seehttp://www.unicode.org/charts/PDF/UD800.pdf andhttp://www.unicode.org/charts/PDF/UDC00.pdf which clearly say that thereare no Unicode characters with those codepoints.

How can the codepoints U+D800 U+DC00 be distinguished in a DOMString
implementation that uses UTF-16 under the hood from the codepoint
U+10000?

They don't have to be; if 0xD800 0xDC00 are present (in that order) thenthey encode U+10000. If they're present on their own, it's not a validUTF-16 string, hence not a valid DOMString and some sort oferror-handling behavior (which presumably needs defining) needs to takeplace.

That said, defining JS strings and DOMString differently seems like arecipe for serious author confusion (e.g. actually using JS strings asthe DOMString binding in ES might be lossy, assigning from JS strings toDOMString might be lossy, etc). It's a minefield.


-Boris
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode strings strawman

Reply via email to