Re: Full Unicode based on UTF-16 proposal

Gavin Barraclough Mon, 26 Mar 2012 13:07:39 -0700

Hi Norbert,

I really like the direction you're going in, but have one minor concern 
relating to regular expressions.

In your proposal, you currently state:
        "A code unit that is in the range 0xD800 to 0xDFFF, but is not part of 
a surrogate pair, is interpreted as a code point with the same value."

I think this makes sense in the context of your original proposal, which seeks 
to be backwards compatible with existing regular expressions through the range 
transformations.  But I'm concerned that this might prove problematic, and 
would suggest that if we're going to make unicode regexp match opt-in through a 
/u flag then instead it may be better to make unpaired surrogates in unicode 
expressions a syntax error.

My concern would be expressions such as:
        /[\uD800\uDC00\uDC00\uD800]/u
Under my reading of the current proposal, this could match any of 
"\uD800\uDC00", "\uD800", or "\uDC00".  Allowing this seems to introduce the 
concept of precedence to character classes (given an input "\uD800\uDC00", 
should I choose to match "\uD800\uDC00" or "\uD800"?).  It may also 
significantly complicate the implementation of backtracking if we were to allow 
this (if I have matched "\uD800\uDC00", should I step back by one code unit or 
two?).

It also just seems much clearer from a user perspective to say that non-unicode 
regular expressions match code units, unicode regular expressions match code 
points - mixing the two seems unhelpful.

If opt-in is automatic in modules, programmers will likely want an escape to be 
able to write non-unicode regular expressions, but I don't think this should 
warrant an extra flag, I don't think we can automatically change the behaviour 
of the RegExp constructor (without a "u" flag being passed), so 
RegExp("\uD800") should still be available to support non-unicode matching 
within modules.

cheers,
G.

On Mar 16, 2012, at 12:18 AM, Norbert Lindenberg wrote:

> Based on my prioritization of goals for support for full Unicode in 
> ECMAScript [1], I've put together a proposal for supporting the full Unicode 
> character set based on the existing representation of text in ECMAScript 
> using UTF-16 code unit sequences:
> http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/index.html
> 
> The detailed proposed spec changes serve to get a good idea of the scope of 
> the changes, but will need some polishing.
> 
> Comments?
> 
> Thanks,
> Norbert
> 
> [1] https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html
> 
> _______________________________________________
> es-discuss mailing list
> [email protected]
> https://mail.mozilla.org/listinfo/es-discuss

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

Reply via email to