JavaScript source today is a sequence of UTF-16 code units because that's what 
clause 6 of ES5 says and what most implementations do (V8/Node currently limits 
to UCS-2, but a fix for that is on the way): "If an actual source text is 
encoded in a form other than 16-bit code units it must be processed as if it 
was first converted to UTF-16."

Actual source code is normally encoded in UTF-8 or some legacy encoding, so it 
must be converted to UTF-16. The rest of the ES5 spec deals with source text in 
terms of code units, not in terms of code points.

The term "code point" is defined in clause 6 of ES5 (in a way that's slightly 
incompatible with the Unicode definition), but the only normative use is in 
relation to URI mappings in subclause 15.1.3, never in relation to source code.

Allen, Brendan, and I have proposed several ways to move to code point 
semantics in ES6, with each proposal representing a different trade-off between 
compatibility with existing code and ease of future development.

Norbert



On Mar 24, 2012, at 13:11 , Wes Garland wrote:

> On 24 March 2012 15:25, David Herman <[email protected]> wrote:
> > Presumably the JS source, as a sequence of UTF-16 code units, represents 
> > the tetragram code points as surrogate pairs.
> 
> Clarification: the JS source *of the regexp literal*.
> 
> 
> We certainly can, although this means that certain Unicode Strings cannot be 
> matched by a regexp with this flag. These strings would be the ones 
> containing reserved code points.
> 
> That said, why is the JS source suddenly a sequence of UTF-16 code units?I 
> believe JS source code should be a sequence of Unicode code points (and I 
> think ES5 says something to this effect).
> 
> The underlying transport format should not be a concern for the JS lexer.  
> The lexer should receive a series of code points from the network transport, 
> allowing web sites to transmit JS in whatever encoding they see fit, provided 
> the browser and server can both agree on it.  I think UTF-8 would make a fine 
> transport format for JS source code.  IMHO the transport format between the 
> browser and the JS lexer [i.e. the input program encoding] should be allowed 
> to be implementation-defined and not specified by TC-39.
> 
> Wes
> 
> -- 
> Wesley W. Garland
> Director, Product Development
> PageMail, Inc.
> +1 613 542 2787 x 102
> _______________________________________________
> es-discuss mailing list
> [email protected]
> https://mail.mozilla.org/listinfo/es-discuss

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Reply via email to