Re: Question about the “full Unicode in strings” strawman

Gillam, Richard Wed, 25 Jan 2012 10:59:32 -0800

> The current 16-bit character strings are sometimes uses to store non-Unicode 
> binary data and can be used with non-Unicode character encoding with up to 
> 16-bit chars.  21 bits is sufficient for Unicode but perhaps is not enough 
> for other useful encodings. 32-bit seems like a plausable unit.


How would an eight-digit \u escape sequence work from an implementation 
standpoint?  I'm assuming most implementations right now use 16-bit unsigned 
values as the individual elements of a String.  If we allow arbitrary 32-bit 
values to be placed into a String, how would you make that work?  There seem to 
only be a few options:

a) Change the implementation to use 32-bit units.

b) Change the implementation to use either 32-bit units as needed, with some 
sort of internal flag that specifies the unit size for an individual string.

c) Encode the 32-bit values somehow as a sequence of 16-bit values.

If you want to allow full generality, it seems like you'd be stuck with option 
a or option b.  Is there really enough value in doing this?

If, on the other hand, the idea is just to make it easier to include non-BMP 
Unicode characters in strings, you can accomplish this by making a long \u 
sequence just be shorthand for the equivalent sequence in UTF-16:  \u10ffff 
would be exactly equivalent to \udbff\udfff.  You don't have to change the 
internal format of the string, the indexes of individual characters stay the 
same, etc.

--Rich Gillam
 Lab126

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Question about the “full Unicode in strings” strawman

Reply via email to