On 5/16/11 4:37 PM, Mike Samuel wrote:
You might have. If you reject my assertion about option 2 above, then
to clarify,
The UTF-16 representation of codepoint U+10000 is the code-unit pair
U+D8000 U+DC000.
No. The UTF-16 representation of codepoint U+10000 is the code-unit
pair 0xD800 0xDC00. These are 16-bit unsigned integers, NOT Unicode
characters (which is what the U+NNNNN notation means).
The UTF-16 representation of codepoint U+D8000 is the single code-unit
U+D8000 and similarly for U+DC00.
I'm assuming you meant U+D800 in the first two code-units there.
There is no Unicode codepoint U+D800 or U+DC00. See
http://www.unicode.org/charts/PDF/UD800.pdf and
http://www.unicode.org/charts/PDF/UDC00.pdf which clearly say that there
are no Unicode characters with those codepoints.
How can the codepoints U+D800 U+DC00 be distinguished in a DOMString
implementation that uses UTF-16 under the hood from the codepoint
U+10000?
They don't have to be; if 0xD800 0xDC00 are present (in that order) then
they encode U+10000. If they're present on their own, it's not a valid
UTF-16 string, hence not a valid DOMString and some sort of
error-handling behavior (which presumably needs defining) needs to take
place.
That said, defining JS strings and DOMString differently seems like a
recipe for serious author confusion (e.g. actually using JS strings as
the DOMString binding in ES might be lossy, assigning from JS strings to
DOMString might be lossy, etc). It's a minefield.
-Boris
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss