On 5/17/11 2:12 PM, Wes Garland wrote:
That said, you can encode these code points with utf-8; for example,
0xdc08 becomes 0xed 0xb0 0x88.

By the same argument, you can encode them in UTF-16. The byte sequence above is not valid UTF-8. See "How do I convert an unpaired UTF-16 surrogate to UTF-8?" at http://unicode.org/faq/utf_bom.html which says:

  A different issue arises if an unpaired surrogate is encountered
  when converting ill-formed UTF-16 data. By represented such an
  unpaired surrogate on its own as a 3-byte sequence, the resulting
  UTF-8 data stream would become ill-formed. While it faithfully
  reflects the nature of the input, Unicode conformance requires that
  encoding form conversion always results in valid data stream.
  Therefore a converter must treat this as an error.

(fwiw, this is the third hit on Google for "utf-8 surrogates" right after the Wikipedia articles on UTF-8 and UTF-16, so it's not like it's hard to find this information).

    No, you're allowing storage of some sort of number arrays that don't
    represent Unicode strings at all.

No, if I understand Allen's proposal correctly, we're allowing storage
of some sort of number arrays that may contain reserved code points,
some of which cannot be represented in UTF-16.

See above. You're allowing number arrays that may or may not be interpretable as Unicode strings, period.

This isn't that different from the status quo; it is possible right now
to generate JS Strings which are not valid UTF-16 by creating invalid
surrogate pairs.

True. However right now no one is pretending that strings are anything other than arrays of 16-bit units.

Keep in mind, also, that even a sequence of random bytes is a valid
Unicode string. The standard does not require that they be well-formed.
(D80)

Uh... A sequence of _bytes_ is not anything related to Unicode unless you know how it's encoded.

Not sure what "(D80)" is supposed to mean.

    Right, so if it's looking for non-BMP characters in the string, say,
    instead of computing the length, it won't find them.  How the heck
    is that "just works"?

My untested hypothesis is that the vast majority of JS code looking for
non-BMP characters is looking for them in order to call them out for
special processing, because the code unit and code point size are
different.  When they don't need special processing, they don't need to
be found.

This hypothesis is worth testing before being blindly inflicted on the web.

    What would that even mean?  DOMString is defined to be an ES string
    in the ES binding right now.  Is the proposal to have some other
    kind of object for DOMString (so that, for example, String.prototype
    would no longer affect the behavior of DOMString the way it does now)?

Wait, are DOMStrings formally UTF-16, or are they ES Strings?

DOMStrings are formally UTF-16 in the DOM spec.

They are defined to be ES strings in the ES binding for the DOM.

Please be careful to not confuse the DOM and its language bindings.

One could change the ES binding to use a non-ES-string object to preserve the DOM's requirement that strings be sequences of UTF-16 code units. I'd expect this would break the web unless one is really careful doing it...

    How is that different from sticking non-UTF-16 into an ES string
    right now?

Currently, JS Strings are effectively arrays of 16-bit code units, which
are indistinguishable from 16-bit Unicode strings

Yes.

(D82)

?

This means that a JS application can use JS Strings as arrays of uint16, and 
expect
to be able to round-trip all strings, even those which are not
well-formed, through a UTF-16 DOM.

Yep.  And they do.

If we redefine JS Strings to be arrays of Unicode code points, then the
JS application can use JS Strings as arrays uint21 -- but round-tripping
the high-surrogate code points through a UTF-16 layer would not work.

OK, that seems like a breaking change.

        It might mean extra copying, or it might not if the DOM
        implementation already uses
        UTF-8 internally.

    Uh... what does UTF-8 have to do with this?

If you're already storing UTF-8 strings internally, then you are already
doing something "expensive" (like copying) to get their code units into
and out of JS

Maybe, and maybe not. We (Mozilla) have had some proposals to actually use UTF-8 throughout, including in the JS engine; it's quite possible to implement an API that looks like a 16-bit array on top of UTF-8 as long as you allow invalid UTF-8 that's needed to represent surrogates and the like.

    (As a note, Gecko and WebKit both use UTF-16 internally; I would be
    _really_ surprised if Trident does not.  No idea about Presto.)

FWIW - last I time I scanned the v8 sources, it appeared to use a
three-representation class, which could store either ASCII, UCS2, or
UTF-8.  Presumably ASCII could also be ISO-Latin-1, as both are exact,
naive, byte-sized UCS2/UTF-16 subsets.

There's a difference between internal representation and what things look like. For example, Gecko stores DOM text nodes as either ASCII or UTF-16 in practice, but always makes them look like UTF-16 to non-internal consumers....

There's also a possible difference, as you just noted, between what the ES implementation uses and what the DOM uses; certainly in the WebKit+V8 case, but also in the Gecko+Spidermonkey case when textnodes are involved, etc.

I was talking about what the DOM implementations do, not the ES implementations.

-Boris
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Reply via email to