RE: Full Unicode strings strawman

Shawn Steele Mon, 16 May 2011 14:51:59 -0700

I’d go further and also say there isn’t really a very big practical difference 
between:


·         A UCS-2 implementation who’s data is rendered by a completely Unicode 
aware rendering engine, and

·         A UTF-16 implementation.

In fact I’m unaware of any UCS-2/UTF-16 conversion functionality that cause 
D800-DFFF to throw an error or change to U+FFFD, most just blindly pass along 
the input, pretending they’re the same, or at least “close enough.”

-Shawn

From: [email protected] [mailto:[email protected]] On 
Behalf Of Mark Davis ?
Sent: Monday, May 16, 2011 2:45 PM
To: Shawn Steele
Cc: Jungshik Shin (신정식, 申政湜); Markus Scherer; [email protected]
Subject: Re: Full Unicode strings strawman

In terms of implementation capabilities, there isn't really a significant 
practical difference between

  *   a UCS-2 implementation, and
  *   a UTF-16 implementation that doesn't have supplemental characters in its 
supported repertoire.

Mark

— Il meglio è l’inimico del bene —

On Mon, May 16, 2011 at 14:28, Shawn Steele 
<[email protected]<mailto:[email protected]>> wrote:
I think the problem isn’t so much that the spec used UCS-2, but rather that 
some implementations used UTF-16 instead as that is more convenient in many 
cases.  To the application developer, it’s difficult to tell the difference 
between UCS-2 and UTF-16 if I can use a regular expression to find D800, DC00.  
Indeed, when the rendering engine of whatever host is going to display the 
glyph for U+10000, it’d be hard to notice the subtlety of UCS-2 vs UTF-16.

-Shawn

From: [email protected]<mailto:[email protected]> 
[mailto:[email protected]<mailto:[email protected]>] 
On Behalf Of Jungshik Shin (???, ???)
Sent: Monday, May 16, 2011 2:24 PM
To: Mark Davis ☕
Cc: Markus Scherer; [email protected]<mailto:[email protected]>

Subject: Re: Full Unicode strings strawman


On Mon, May 16, 2011 at 2:19 PM, Mark Davis ☕ 
<[email protected]<mailto:[email protected]>> wrote:
I'm quite sympathetic to the goal, but the proposal does represent a 
significant breaking change. The problem, as Shawn points out, is with 
indexing. Before, the strings were defined as UTF16.

I agree with Mark wrote except that the previous spec used UCS-2, which this 
proposal (and other proposals on the issue) try to rectify. I think that taking 
Java's approach would work better with DOMString as well.

See W3C I18N WG's 
proposal<http://www.w3.org/International/wiki/JavaScriptInternationalization>  
on the issue and Java's 
approach<http://java.sun.com/developer/technicalArticles/Intl/Supplementary/> 
linked there)

Jungshik


Take a sample string "\ud800\udc00\u0061" = "\u{10000}\u{61}". Right now, the 
'a' (the \u{61}) is at offset 2. If the proposal were accepted, the 'a' would 
be at offset 1. This will definitely cause breakage in existing code; 
characters are in different positions than they were, even characters that are 
not supplemental ones. All it takes is one supplemental character before the 
current position and the offsets will be off for the rest of the string.

Faced with exactly the same problem, Java took a different approach that allows 
for handling of the full range of Unicode characters, but maintains backwards 
compatibility. It may be instructive to look at what they did (although there 
was definitely room for improvement in their approach!). I can follow up with 
that if people are interested. Alternatively, perhaps mechanisms can put in 
place to tell ECMAScript to use new vs old indexing (Perl uses PRAGMAs for that 
kind of thing, for example), although that has its own ugliness.

Mark

— Il meglio è l’inimico del bene —
On Mon, May 16, 2011 at 13:38, Wes Garland <[email protected]<mailto:[email protected]>> 
wrote:
Allen;

Thanks for putting this together.  We use Unicode data extensively in both our 
web and server-side applications, and being forced to deal with UTF-16 
surrogate pair directly -- rather than letting the String implementation deal 
with them -- is a constant source of mild pain.  At first blush, this proposal 
looks like it meets all my needs, and my gut tells me the perf impacts will 
probably be neutral or good.

Two great things about strings composed of Unicode code points:
1) .length represents the number of code points, rather than the number of 
pairs used in UTF-16, even if the underlying representation isn't UTF-16
2) S.charCodeAt(S.indexOf(X)) always returns the same kind of information (a 
Unicode code point), regardless of whether X is in the BMP or not

If though this is a breaking change from ES-5, I support it whole-heartedly.... 
but I expect breakage to be very limited. Provided that the implementation does 
not restrict the storage of reserved code points (D800-DF00), it should be 
possible for users using String as immutable C-arrays to keep doing so. Users 
doing surrogate pair decomposition will probably find that their code "just 
works", as those code points will never appear in legitimate strings of Unicode 
code points.  Users creating Strings with surrogate pairs will need to re-tool, 
but this is a small burden and these users will be at the upper strata of 
Unicode-foodom.  I suspect that 99.99% of users will find that this change will 
fix bugs in their code when dealing with non-BMP characters.

Mike Samuel, there would never a supplement code unit to match, as the return 
value of [[Get]] would be a code point.

Shawn Steele, I don't understand this comment:

Also, the “trick” I think, is encoding to surrogate pairs (illegally, since 
UTF8 doesn’t allow that) vs decoding to UTF16.

Why do we care about the UTF-16 representation of particular codepoints?  Why 
can't the new functions just encode the Unicode string as UTF-8 and URI escape 
it?

Mike Samuel, can you explain why you are en/decoding UTF-16 when round-tripping 
through the DOM?  Does the DOM specify UTF-16 encoding? If it does, that's 
silly.  Both ES and DOM should specify "Unicode" and let the data interchange 
format be an implementation detail.  It is an unfortunate accident of history 
that UTF-16 surrogate pairs leak their abstraction into ES Strings, and I 
believe it is high time we fixed that.

Wes

--
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102<tel:%2B1%20613%20542%202787%20x%20102>
_______________________________________________
es-discuss mailing list
[email protected]<mailto:[email protected]>
https://mail.mozilla.org/listinfo/es-discuss


_______________________________________________
es-discuss mailing list
[email protected]<mailto:[email protected]>
https://mail.mozilla.org/listinfo/es-discuss

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

RE: Full Unicode strings strawman

Reply via email to