That is incorrect. See below. Mark
*— Il meglio è l’inimico del bene —* On Tue, May 17, 2011 at 18:33, Wes Garland <[email protected]> wrote: > On 17 May 2011 20:09, Boris Zbarsky <[email protected]> wrote: > >> On 5/17/11 5:24 PM, Wes Garland wrote: >> >>> Okay, I think we have to agree to disagree here. I believe my reading of >>> the spec is correct. >>> >> >> Sorry, but no... how much more clear can the spec get? >> >> > In the past, I have read it thus, pseudo BNF: > > UnicodeString => CodeUnitSequence // D80 > CodeUnitSequence => CodeUnit | CodeUnitSequence CodeUnit // D78 > CodeUnit => <anything in the current encoding form> // D77 > So far, so good. In particular, d800 is a code unit for UTF-16, since it is a code unit that can occur in some code unit sequence in UTF-16. > > Upon careful re-reading of this part of the specification, I see that D79 > is also important. It says that "A Unicode encoding form assigns each > Unicode scalar value to a unique code unit sequence.", > True. > and further clarifies that "The mapping of the set of Unicode scalar values > to the set of code unit sequences for a Unicode encoding form is > one-to-one." > True. This is all consistent with saying that UTF-16 can't contain an isolated d800. *However, that only shows that a Unicode 16-bit string (D82) is not the same as a UTF-16 String (D89), which has been pointed out previously.* * * Repeating the note under D89: A Unicode string consisting of a well-formed UTF-16 code unit sequence is said to be *in UTF-16*. Such a Unicode string is referred to as a *valid UTF-16 string*, or a *UTF-16 string* for short. * * That is, every UTF-16 string is a Unicode 16-bit string, but *not* vice versa. Examples: - "\u0061\ud800\udc00" is both a Unicode 16-bit string and a UTF-16 string. - "\u0061\ud800\udc00" is a Unicode 16-bit string, but not a UTF-16 string. > This means that your original assertion -- that Unicode strings cannot > contain the high surrogate code points, regardless of meaning -- is in fact > correct. > That is incorrect. > > Which is unfortunate, as it means that we either > > 1. Allow non-Unicode strings in JS -- i.e. Strings composed of all > values in the set [0x0, 0x1FFFFF] > 2. Keep making programmers pay the raw-UTF-16 representation tax > 3. Break the String-as-uint16 pattern > > I still believe that #1 is the way forward, and that problem of > round-tripping these values through the DOM is solvable. > > Wes > > -- > Wesley W. Garland > Director, Product Development > PageMail, Inc. > +1 613 542 2787 x 102 > > _______________________________________________ > es-discuss mailing list > [email protected] > https://mail.mozilla.org/listinfo/es-discuss > >
_______________________________________________ es-discuss mailing list [email protected] https://mail.mozilla.org/listinfo/es-discuss

