On 5/16/11 5:23 PM, Shawn Steele wrote:
I’m having some (ok, a great deal of) confusion between the DOM Encoding and the JavaScript encoding and whatever. I’d assumed that if I had a web page in some encoding, that it was converted to UTF-16 (well, UCS-2), and that’s what the JavaScript engine did it’s work on.
JS strings are currently defined as arrays of 16-bit unsigned integers. I believe the intent at the time was that these could represent actual Unicode strings encoded as UCS-2, but they can also represent arbitrary arrays of 16-bit unsigned integers.
The DOM just uses JS strings for DOMString and defines DOMString to be UTF-16. That's not quite compatible with UCS-2, but....
JS strings can contain integers that correspond to UTF-16 surrogates. There are no constraints in what comes before/after them in JS strings.
In UTF-8, individually encoded surrogates are illegal (and a security risk). Eg: you shouldn’t be able to encode D800/DC00 as two 3 byte sequences, they should be a single 6 byte sequence
A single 4 byte sequence, actually, last I checked.
Having not played with the js encoding/decoding in quite some time, I’m not sure what they do in that case, but hopefully it isn’t illegal UTF-8.
I'm not sure which "they" and under what conditions we're considering here.
(You also shouldn’t be able to have half a surrogate pair in UTF-16, but many things are pretty lax about that.)
Verily. -Boris _______________________________________________ es-discuss mailing list [email protected] https://mail.mozilla.org/listinfo/es-discuss

