In terms of implementation capabilities, there isn't really a significant practical difference between
- a UCS-2 implementation, and - a UTS-16 implementation that doesn't have supplemental characters in its supported repertoire. Mark *— Il meglio è l’inimico del bene —* On Mon, May 16, 2011 at 14:28, Shawn Steele <[email protected]>wrote: > I think the problem isn’t so much that the spec used UCS-2, but rather > that some implementations used UTF-16 instead as that is more convenient in > many cases. To the application developer, it’s difficult to tell the > difference between UCS-2 and UTF-16 if I can use a regular expression to > find D800, DC00. Indeed, when the rendering engine of whatever host is > going to display the glyph for U+10000, it’d be hard to notice the subtlety > of UCS-2 vs UTF-16. > > > > -Shawn > > > > *From:* [email protected] [mailto: > [email protected]] *On Behalf Of *Jungshik Shin (???, ???) > *Sent:* Monday, May 16, 2011 2:24 PM > *To:* Mark Davis ☕ > *Cc:* Markus Scherer; [email protected] > > *Subject:* Re: Full Unicode strings strawman > > > > > > On Mon, May 16, 2011 at 2:19 PM, Mark Davis ☕ <[email protected]> wrote: > > I'm quite sympathetic to the goal, but the proposal does represent a > significant breaking change. The problem, as Shawn points out, is with > indexing. Before, the strings were defined as UTF16. > > > > I agree with Mark wrote except that the previous spec used UCS-2, which > this proposal (and other proposals on the issue) try to rectify. I think > that taking Java's approach would work better with DOMString as well. > > > > See W3C I18N WG's > proposal<http://www.w3.org/International/wiki/JavaScriptInternationalization> > on the issue and Java's > approach<http://java.sun.com/developer/technicalArticles/Intl/Supplementary/>linked > there) > > > > Jungshik > > > > > > Take a sample string "\ud800\udc00\u0061" = "\u{10000}\u{61}". Right now, > the 'a' (the \u{61}) is at offset 2. If the proposal were accepted, the 'a' > would be at offset 1. This will definitely cause breakage in existing code; > characters are in different positions than they were, even characters that > are not supplemental ones. All it takes is one supplemental character before > the current position and the offsets will be off for the rest of the string. > > > > Faced with exactly the same problem, Java took a different approach that > allows for handling of the full range of Unicode characters, but maintains > backwards compatibility. It may be instructive to look at what they did > (although there was definitely room for improvement in their approach!). I > can follow up with that if people are interested. Alternatively, perhaps > mechanisms can put in place to tell ECMAScript to use new vs old indexing > (Perl uses PRAGMAs for that kind of thing, for example), although that has > its own ugliness. > > > > Mark > > *— Il meglio è l’inimico del bene —* > > On Mon, May 16, 2011 at 13:38, Wes Garland <[email protected]> wrote: > > Allen; > > Thanks for putting this together. We use Unicode data extensively in both > our web and server-side applications, and being forced to deal with UTF-16 > surrogate pair directly -- rather than letting the String implementation > deal with them -- is a constant source of mild pain. At first blush, this > proposal looks like it meets all my needs, and my gut tells me the perf > impacts will probably be neutral or good. > > Two great things about strings composed of Unicode code points: > 1) .length represents the number of code points, rather than the number of > pairs used in UTF-16, even if the underlying representation isn't UTF-16 > 2) S.charCodeAt(S.indexOf(X)) always returns the same kind of information > (a Unicode code point), regardless of whether X is in the BMP or not > > If though this is a breaking change from ES-5, I support it > whole-heartedly.... but I expect breakage to be very limited. Provided that > the implementation does not restrict the storage of reserved code points > (D800-DF00), it should be possible for users using String as immutable > C-arrays to keep doing so. Users doing surrogate pair decomposition will > probably find that their code "just works", as those code points will never > appear in legitimate strings of Unicode code points. Users creating Strings > with surrogate pairs will need to re-tool, but this is a small burden and > these users will be at the upper strata of Unicode-foodom. I suspect that > 99.99% of users will find that this change will fix bugs in their code when > dealing with non-BMP characters. > > Mike Samuel, there would never a supplement code unit to match, as the > return value of [[Get]] would be a code point. > > Shawn Steele, I don't understand this comment: > > > > Also, the “trick” I think, is encoding to surrogate pairs (illegally, since > UTF8 doesn’t allow that) vs decoding to UTF16. > > > Why do we care about the UTF-16 representation of particular codepoints? > Why can't the new functions just encode the Unicode string as UTF-8 and URI > escape it? > > Mike Samuel, can you explain why you are en/decoding UTF-16 when > round-tripping through the DOM? Does the DOM specify UTF-16 encoding? If it > does, that's silly. Both ES and DOM should specify "Unicode" and let the > data interchange format be an implementation detail. It is an unfortunate > accident of history that UTF-16 surrogate pairs leak their abstraction into > ES Strings, and I believe it is high time we fixed that. > > Wes > > -- > Wesley W. Garland > Director, Product Development > PageMail, Inc. > +1 613 542 2787 x 102 > > _______________________________________________ > es-discuss mailing list > [email protected] > https://mail.mozilla.org/listinfo/es-discuss > > > > > _______________________________________________ > es-discuss mailing list > [email protected] > https://mail.mozilla.org/listinfo/es-discuss > > >
_______________________________________________ es-discuss mailing list [email protected] https://mail.mozilla.org/listinfo/es-discuss

