Re: Question about the “full Unicode in strings” strawman

Allen Wirfs-Brock Wed, 25 Jan 2012 11:55:33 -0800

On Jan 25, 2012, at 10:59 AM, Gillam, Richard wrote:

>> The current 16-bit character strings are sometimes uses to store non-Unicode 
>> binary data and can be used with non-Unicode character encoding with up to 
>> 16-bit chars.  21 bits is sufficient for Unicode but perhaps is not enough 
>> for other useful encodings. 32-bit seems like a plausable unit.
> 
> How would an eight-digit \u escape sequence work from an implementation 
> standpoint?  I'm assuming most implementations right now use 16-bit unsigned 
> values as the individual elements of a String.  If we allow arbitrary 32-bit 
> values to be placed into a String, how would you make that work?  There seem 
> to only be a few options:
> 
> a) Change the implementation to use 32-bit units.
> 
> b) Change the implementation to use either 32-bit units as needed, with some 
> sort of internal flag that specifies the unit size for an individual string.
> 
> c) Encode the 32-bit values somehow as a sequence of 16-bit values.
> 
> If you want to allow full generality, it seems like you'd be stuck with 
> option a or option b.  Is there really enough value in doing this?


This issue is somewhat address in the proposal in the implementation impacts 
section 
http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings#possible_implementation_impacts
 

My assumption is that most implementation would choose b.  Although the other 
would all be valid implementation approaches.  Note that some implementations 
already use multiple alternative internal string representations in order to 
optimize various scenarios. 
> 
> If, on the other hand, the idea is just to make it easier to include non-BMP 
> Unicode characters in strings, you can accomplish this by making a long \u 
> sequence just be shorthand for the equivalent sequence in UTF-16:  \u10ffff 
> would be exactly equivalent to \udbff\udfff.  You don't have to change the 
> internal format of the string, the indexes of individual characters stay the 
> same, etc.

The primary intent of the proposal was to extend ES Strings to support a 
uniform represent of all Unicode characters, including non-BMP.  That means 
that any Unicode character should occupy exactly one element position within a 
String value.  Interpreting \u{10ffff} as an UTF-16 encoding does not satisfy 
that objective.  In particular, under that approach "\{10ffff}".length would be 
2 while a uniform character representation should yield a length of 1.

When this proposal was originally floated, the much of debated seemed to be 
about whether such a uniform character representation was desirable or even 
useful.  See the thread starting at 
https://mail.mozilla.org/pipermail/es-discuss/2011-May/014252.html also 
https://mail.mozilla.org/pipermail/es-discuss/2011-May/014316.html and  

Allen

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Question about the “full Unicode in strings” strawman

Reply via email to