Re: Full Unicode strings strawman

Gillam, Richard Mon, 16 May 2011 15:25:19 -0700

Allen--

I tried to post a pointer to this strawman on this list a few weeks ago, but 
apparently it didn't reach the list for some reason.


Feed back would be appreciated:

http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings

I was actually on the committee when the language you're proposing to change 
was adopted and, in fact, I think I actually proposed that wording.

The intent behind the original wording was to extend the standard back then in 
ES3 to allow the use of the full range of Unicode characters, and to do it in 
more or less the same way that Java had done it: While the actual choice of an 
internal string representation would be left up to the implementer, all public 
interfaces (where it made a difference) would behave exactly as if the internal 
representation was UTF-16.  In particular, you would represent 
supplementary-plane characters with two \u escape sequences representing a 
surrogate pair, and interfaces that assigned numeric indexes to characters in 
strings would do so based on the UTF-16 representation of the string-- a 
supplementary-plane character would take up two character positions in the 
string.

I don't have a problem with introducing a new escaping syntax for 
supplementary-plane characters, but I really don't think you want to go messing 
with string indexing.  It'd be a breaking change for existing implementations.  
I don't think it actually matters if the existing implementation "supported" 
UTF-16 or not-- if you read string content that included surrogate pairs from 
some external source, I doubt anything in the JavaScript implementation was 
filtering out the surrogate pairs because the implementation "only supported 
UCS-2".  And most things would have worked fine.  But the characters would be 
numbered according to their UTF-16 representation.

If you want to introduce new APIs that index things according to the UTF-32 
representation, that'd be okay, but it's more of a burden for implementations 
that use UTF-16 for their internal representation, and we optimized for that on 
the assumption that it was the most common choice.

Defining String.fromCharCode() to build a string based on an abstract Unicode 
code point value might be okay (although it might be better to make that a new 
function), but when presented with a code point value about 0xFFFF, it'd 
produce a string of length 2-- the length of the UTF-16 representation.  
String.charCodeAt() was always defined, and should continue to be defined, 
based on the UTF-16 representation.  If you want to introduce a new API based 
on the UTF-32 representation, fine.

I'd also recommend against flogging the 21-bit thing so heavily-- the 21-bit 
thing is sort of an accident of history, and not all 21-bit values are legal 
Unicode code point values either.  I'd use "32" for the longer forms of things.

I think it's fine to have everything work in terms of abstract Unicode code 
points, but I don't think you can ignore the backward-compatibility issues with 
character indexing in the current API.

--Rich Gillam
  Lab126

_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode strings strawman

Reply via email to