On 27 Feb 2012, at 22:58, Allen Wirfs-Brock <al...@wirfs-brock.com> wrote:

> On Feb 26, 2012, at 1:55 AM, Mathias Bynens wrote:
> 
>> For example, U+2F800 CJK COMPATIBILITY IDEOGRAPH-2F800 is a supplementary 
>> Unicode character in the [Lo] category, which leads me to believe it should 
>> be allowed in identifier names. After all, the spec says:
>> 
>> UnicodeLetter = any character in the Unicode categories “Uppercase letter 
>> (Lu)”, “Lowercase letter (Ll)”, “Titlecase letter (Lt)”, “Modifier letter 
>> (Lm)”, “Other letter (Lo)”, or “Letter number (Nl)”.
>> 
>> However, since JavaScript uses UCS-2 internally, this symbol is represented 
>> by a surrogate pair, i.e. two code units: `\uD87E\uDC00`.
>> 
>> The spec, however, defines “character” as follows: 
>> http://es5.github.com/x6.html#x6
>> 
>> Throughout the rest of this document, the phrase “code unit” and the word 
>> “character” will be used to refer to a 16-bit unsigned value used to 
>> represent a single 16-bit unit of text. The phrase “Unicode character” will 
>> be used to refer to the abstract linguistic or typographical unit 
>> represented by a single Unicode scalar value (which may be longer than 16 
>> bits and thus may be represented by more than one code unit). The phrase 
>> “code point” refers to such a Unicode scalar value. “Unicode character” only 
>> refers to entities represented by single Unicode scalar values: the 
>> components of a combining character sequence are still individual “Unicode 
>> characters,” even though a user might think of the whole sequence as a 
>> single character.
>> 
>> So, based on this definition of “character” (code unit), U+2F800 should not 
>> be allowed in an identifier name after all.
>> 
>> I’m not sure if my interpretation of the spec is correct, though. Could 
>> anyone confirm or deny this? Are supplementary (non-BMP) Unicode characters 
>> allowed in identifiers or not? For example, is this valid JavaScript or not?
> 
> Yes, this interpretation is consistent with my understanding of the 
> requirements as expressed in the ES5 spec.   ES5 logically only works with 
> UCS-2 characters corresponding to the BMP.
> 
> Some (probably most) implementations pass UTF-16 encodings of supplemental 
> characters to the JavaScript compiler.  According to the spec, these are 
> processed as two UCS-2 characters neither of which would be a member of any 
> of the above character categories.  Their use in an identifier context should 
> result in a syntax error.  Within a string literal, the two UCS-2 characters 
> would generate two string elements.
> 
> This is something that I think can be clarified for the ES6 specification, 
> independent of the on-going discussion of the possibility of 21-bit string 
> elements.  My preference for the future is to simply define the input 
> alphabet of ECMAScript as all Unicode characters independent of actual 
> encoding.

That sounds nice.

> var \ud87e\udc00 would probably still be illegal because each \uXXXX define a 
> separate character but: var \u{2f800} =42; schould be find as should the 
> direct none escaped occurrence of that characters.

Wouldn’t this be confusing, though?

    global['\u{2F800}'] = 42; // would work (compatible with ES5 behavior)
    global['\uD87E\uDC00'] = 42; // would work, too, since `'\uD87E\uDC00' == 
'\u{2F800}'` (compatible with ES5 behavior)
    var \uD87E\uDC00 = 42; // would fail (compatible with ES5 behavior)
    var \u{2F800} = 42; // would work (as per your comment; incompatible with 
ES5 behavior)
    var 丽 = 42; // would work (as per your comment; incompatible with ES5 
behavior)

Using astral symbols in identifiers would be backwards incompatible, even if 
the raw (unescaped) symbol is used. There’d be no way to use such an identifier 
in an ES5 environment. Is this a problem?
_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Reply via email to