Re: Full Unicode strings strawman

Allen Wirfs-Brock Mon, 16 May 2011 13:17:35 -0700

On May 16, 2011, at 11:34 AM, Shawn Steele wrote:

> Thanks for making a strawman
(see my very last sentence below as it may impact the interpreation of some of 
the rest of these responses)



>  
> Unicode Escape Sequences
> Is it possible for U+ to accept either 4, 5, or 6 digit sequences?   
> Typically when I encounter U+ notation the leading zero is omitted, and I see 
> BMP characters quite often.  Obviously BMP could use the U notation, however 
> it seems like it’d be annoying to the occasional user to know that U is used 
> for some and U+ for others.  Seems like it’d be easier for developers to 
> remember that U+ is “the new way” and U is “the old way that doesn’t always 
> work”.

The ES string literal notation does't really accommodate  variable length 
subtokens without explicit terminators.  What would be the rules for parsing 
"\u+12345678".  How do we know if the programmer meant "\u1234"+"5678" or 
"\u0012"+"345678" or ...

There have been past proposals for a syntax like \u{xxxxxx} that could have 1to 
6 hex digits.  In the past proposal the assumption was that it would produce 
UTF-16 surrogate pairs but in this context we could adopt it instead of \u+ to 
produce a single character.  The disadvantage is that it is a slightly long 
sequence for actual large code points.  On the other hand perhaps it is more 
readable?  "\u+123456\u+123456" vs. "\u{123456}\u{123456}" ??


>  
> String Position
> It’s unclear to me if the string indices can be “changed” from UTF-16 to 
> UTF-32 positions.  Although UTF-32 indices are clearly desirable, I think 
> that many implementations currently allow UTF-16 codepoints U+D800 through 
> U+DFFF.  In other words, I can already have Javascript strings with full 
> Unicode range data in them.  Existing applications would then have indices 
> that pointed to the UTF-16, not UTF-32 index.  Changing the definition of the 
> index to UTF-32 would break those applications I think.

No it wouldn't break anything, at least when applied to existing data.  Your 
existing code is explicitly doing UTF-16 processing.  Somebody had to do the 
processing to create the surrogate pairs in the string. As long as you use that 
same agent to are still going to bet UTF-16 encoded strings. Even though the 
underlying character values could hold single characters with codepoints > 
\uffff the actual string won't unless unless somebody actually constructed the 
string to contain such values.  That presumably doesn't happen for existing 
code.

The place where existing code might break is if somebody explicitly constructs 
a string (using \u+ literals or String.fromCodepoint) that contains non-BMP 
characters and passes it to routines that that only expect 16-bits characters.  
For this reason, any existing host routines that convert external data 
resources to ES strings that contain surrogate pairs should probably continue 
to do so.  New routines should be provided that produce single characters 
instead of pairs for non-BMP pointpoints.  However, the definition of such 
routines is outside the scope of the ES specification.

Finally, note that just as current strings can contain16-bit character values 
that are not valid Unicode code points, the expanded full unicode strings can 
also contain 21-bit character values that are not valid Unicode codepoints. 

>  
> You also touch on that with charCodeAt/codepointAt, which resolves the 
> problem with the output type, but doesn’t address the problem with the 
> indexing.  Similar to the way you differentiated charCode/codepoint, it may 
> be necessary to differentiate charCode/codepoint indices.  IMO .fromCharCode 
> doesn’t have this problem since it used to fail, but now works, which 
> wouldn’t be breaking.  Unless we’re concerned that now it can return a 
> different UTF-16 length than before.

Again, nothing changes.  Code that expects to deal with multi-character 
encodings can still do so.   What "magically" changes is that code that act 
Unicode like codepoints are only 16-bits (ie, the code doesn't correctly deal 
with surrogate pairs) will now work with full 21-bit characters.

>  
> I don’t like the “21” in the name of decodeURI21.

Suggestions for better names are always welcome.


>   Also, the “trick” I think, is encoding to surrogate pairs (illegally, since 
> UTF8 doesn’t allow that) vs decoding to UTF16.  It seems like decoding can 
> safely detect input supplementary characters and properly decode them, or is 
> there something about encoding that doesn’t make that state detectable?

I think I missing the distinction you are making between surrogate pairs and 
UTF-16.  I think I've been using the terms interchangeably.  I may be munging 
up the terminology.




>  
> -Shawn
>  
> From: [email protected] [mailto:[email protected]] 
> On Behalf Of Allen Wirfs-Brock
> Sent: Monday, May 16, 2011 11:12 AM
> To: [email protected]
> Subject: Full Unicode strings strawman
>  
> I tried to post a pointer to this strawman on this list a few weeks ago, but 
> apparently it didn't reach the list for some reason.
>  
> Feed back would be appreciated:
>  
> http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings
>  
>  
> Allen

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode strings strawman

Reply via email to