Re: Question about the “full Unicode in strings” strawman

Norbert Lindenberg Tue, 24 Jan 2012 23:45:59 -0800

I don't see the standard allowing character encodings other than UTF-16 in 
strings. Section 8.4 says "When a String contains actual textual data, each 
element is considered to be a single UTF-16 code unit." This aligns with other 
normative references to UTF-16 in sections 2, 6, and 15.1.3. Section 8.4 does 
seem to allow the use of strings for non-textual data, but character encodings 
are by definition for characters, i.e., textual data.


Using a Unicode escape for non-textual data seems like abuse to me - Unicode is 
a character encoding standard. For Unicode, anything beyond six hex digits is 
excessive.

Norbert


On Jan 24, 2012, at 17:14 , Allen Wirfs-Brock wrote:

> 
> On Jan 24, 2012, at 2:11 PM, Mark S. Miller wrote:
> 
>> On Tue, Jan 24, 2012 at 12:33 PM, Allen Wirfs-Brock <[email protected]> 
>> wrote:
>> Note that this proposal isn't currently under consideration for inclusion in 
>> ES.next, but the answer to you question is below
>> [...] 
>> Just as the current definition of string specifies that a String is a 
>> sequence of 16-bit unsigned integer values, the proposal would specify that 
>> a String is a sequence of 32-bit unsigned integer values.  In neither cause 
>> is it required that the individual String elements must be valid Unicode 
>> code point or code units. 8 hex digits are required to express a the full 
>> range of unsigned 32-bit integers.
>> 
>> Why 32? Unicode has only 21 bits of significance. Since we don't expect 
>> strings to be stored naively (taking up 4x the space that would otherwise be 
>> allocated),
> I believe most current implementation actually store 16-bits per characters 
> so it would be 2x rather than 4x
>>  
> 
>> I don't see the payoff from choosing the next power of 2. The other choices 
>> I see are a) 21 bits, b) 53 bits, or c) unbounded.
> 
> The current 16-bit character strings are sometimes uses to store non-Unicode 
> binary data and can be used with non-Unicode character encoding with up to 
> 16-bit chars.  21 bits is sufficient for Unicode but perhaps is not enough 
> for other useful encodings.  32-bit seems like a plausable unit.
> 
> The real controversy that developed over this proposal regarded whether or 
> not every individual Unicode characters needs to be uniformly representable 
> as a single element of a String. This proposal took the position that they 
> should.  Other voices felt that such uniformity was unnecessary and seem 
> content to expose UTF-8 or UTF-16.  The argument was that applications may 
> have to look at multiple character logical units anyway, so dealing with UTF 
> encodings isn't much of an added burden. 
> 
> Allen
> _______________________________________________
> es-discuss mailing list
> [email protected]
> https://mail.mozilla.org/listinfo/es-discuss

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Question about the “full Unicode in strings” strawman

Reply via email to