On 5/16/11 5:23 PM, Shawn Steele wrote:
I’m having some (ok, a great deal of) confusion between the DOM Encoding
and the JavaScript encoding and whatever. I’d assumed that if I had a
web page in some encoding, that it was converted to UTF-16 (well,
UCS-2), and that’s what the JavaScript engine did it’s work on.

JS strings are currently defined as arrays of 16-bit unsigned integers. I believe the intent at the time was that these could represent actual Unicode strings encoded as UCS-2, but they can also represent arbitrary arrays of 16-bit unsigned integers.

The DOM just uses JS strings for DOMString and defines DOMString to be UTF-16. That's not quite compatible with UCS-2, but....

JS strings can contain integers that correspond to UTF-16 surrogates. There are no constraints in what comes before/after them in JS strings.

In UTF-8, individually encoded surrogates are illegal (and a security
risk). Eg: you shouldn’t be able to encode D800/DC00 as two 3 byte
sequences, they should be a single 6 byte sequence

A single 4 byte sequence, actually, last I checked.

Having not played
with the js encoding/decoding in quite some time, I’m not sure what they
do in that case, but hopefully it isn’t illegal UTF-8.

I'm not sure which "they" and under what conditions we're considering here.

(You also
shouldn’t be able to have half a surrogate pair in UTF-16, but many
things are pretty lax about that.)

Verily.

-Boris
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Reply via email to