On 16 May 2011 17:42, Boris Zbarsky <[email protected]> wrote:
> On 5/16/11 4:38 PM, Wes Garland wrote:
>
>> Two great things about strings composed of Unicode code points:
>>
> ...
>
> If though this is a breaking change from ES-5, I support it
>> whole-heartedly.... but I expect breakage to be very limited. Provided
>> that the implementation does not restrict the storage of reserved code
>> points (D800-DF00)
>>
>
> Those aren't code points at all. They're just not Unicode.
>
Not quite: code points D800-DFFF are reserved code points which are not
representable with UTF-16. Definition D71, Unicode 6.0.
> If you allow storage of such, then you're allowing mixing Unicode strings
> and "something else" (whatever the something else is), with bad most likely
> bad results.
>
I don't believe this is true. We are merely allowing storage of Unicode
strings which cannot be converted into UTF-16. That allows us to maintain
most of the existing String behaviour (arbitrary array of uint16), although
overflowing like this would break:
a = String.fromCharCode(str.charCodeAt(0) + 1)
when str[0] is 0+FFFF.
> Most simply, assignign a DOMString containing surrogates to a JS string
> should collapse the surrogate pairs into the corresponding codepoint if JS
> strings really contain codepoints...
>
> The only way to make this work is if either DOMString is redefined or
> DOMString and full Unicode strings are different kinds of objects.
>
>
> Users doing surrogate pair decomposition will probably find that their
>> code "just works"
>>
>
> How, exactly?
>
/** Untested and not rigourous */
function unicode_strlen(validUnicodeString) {
var length = 0;
for (var i = 0; i < validUnicodeString.length; i++) {
if (validUnicodeString.charCodeAt(i) >= 0xd800 &&
validUnicodeString.charCodeAt(i) <= 0xdc00)
continue;
length++;
}
return length;
}
Code like this ^^^^ which looks for surrogate pairs in valid Unicode strings
will simply not find them, instead only finding code points which seem to
the same size as the code unit.
>
> Users creating Strings with surrogate pairs will need to
>> re-tool
>>
>
> Such users would include the DOM, right?
>
I am hopeful that most web browsers have one or few interfaces between DOM
strings and JS strings. I do not know if my hopes reflect reality.
> but this is a small burden and these users will be at the upper
>> strata of Unicode-foodom.
>>
>
> You're talking every single web developer here. Or at least every single
> web developer who wants to work with Devanagari text.
>
I don't think so. I bet if we could survey web developers across the
industry (rather than just top-tier people who tend to participate in
discussions like this one), we would find that the vast major of them never
both handling non-BMP cases, and do not test non-BMP cases.
Heck, I don't even know if a non-BMP character can be data-entered into an
<input type="text" maxlength="1"> or not. (Do you? What happens?)
> I suspect that 99.99% of users will find that
>> this change will fix bugs in their code when dealing with non-BMP
>> characters.
>>
>
> Not unless DOMString is changed or the interaction between the two very
> carefully defined in failure-proof ways.
>
Yes, I was dismayed to find out that DOMString defines UTF-16.
We could get away with converting UTF-16 at DOMString <> JSString transition
point. This might mean that it is possible that JSString=>DOMString would
throw, as full Unicode Strings could contain code points which are not
representable in UTF-16.
If don't throw on invalid-in-UTF-16 code points, then round-tripping is
lossy. If it does, that's silly.
>
> It needed to specify _something_, and UTF-16 was the thing that was
> compatible with how scripts work in ES. Not to mention the Java legacy if
> the DOM...
>
By this comment, I am inferring then that DOM and JS Strings share their
backing store. From an API-cleanliness point of view, that's too bad. From
an implementation POV, it makes sense. Actually, it makes even more sense
when I recall the discussion we had last week when you explained how
external strings etc work in SpiderMonkey/Gecko.
Do all the browsers share JS/DOM String backing stores?
It is an unfortunate accident of history that UTF-16 surrogate pairs leak
> their
> abstraction into ES Strings, and I believe it is high time we fixed that.
>
If you can do that without breaking web pages, great. If not, then we need
> to talk. ;)
>
>
There is no question in mind that this proposal would break Unicode-aware
JS. It is my belief that that doesn't matter if it accompanies other major,
opt-in changes.
Resolving DOM String <> JS String interchange is a little trickier, but I
think it can be managed if we can allow JS=>DOM to throw when high surrogate
code points are encountered in the JS String. It might mean extra copying,
or it might not if the DOM implementation already uses UTF-8 internally.
Wes
--
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss