Hi Norbert,
I really like the direction you're going in, but have one minor concern
relating to regular expressions.
In your proposal, you currently state:
"A code unit that is in the range 0xD800 to 0xDFFF, but is not part of
a surrogate pair, is interpreted as a code point with the same value."
I think this makes sense in the context of your original proposal, which seeks
to be backwards compatible with existing regular expressions through the range
transformations. But I'm concerned that this might prove problematic, and
would suggest that if we're going to make unicode regexp match opt-in through a
/u flag then instead it may be better to make unpaired surrogates in unicode
expressions a syntax error.
My concern would be expressions such as:
/[\uD800\uDC00\uDC00\uD800]/u
Under my reading of the current proposal, this could match any of
"\uD800\uDC00", "\uD800", or "\uDC00". Allowing this seems to introduce the
concept of precedence to character classes (given an input "\uD800\uDC00",
should I choose to match "\uD800\uDC00" or "\uD800"?). It may also
significantly complicate the implementation of backtracking if we were to allow
this (if I have matched "\uD800\uDC00", should I step back by one code unit or
two?).
It also just seems much clearer from a user perspective to say that non-unicode
regular expressions match code units, unicode regular expressions match code
points - mixing the two seems unhelpful.
If opt-in is automatic in modules, programmers will likely want an escape to be
able to write non-unicode regular expressions, but I don't think this should
warrant an extra flag, I don't think we can automatically change the behaviour
of the RegExp constructor (without a "u" flag being passed), so
RegExp("\uD800") should still be available to support non-unicode matching
within modules.
cheers,
G.
On Mar 16, 2012, at 12:18 AM, Norbert Lindenberg wrote:
> Based on my prioritization of goals for support for full Unicode in
> ECMAScript [1], I've put together a proposal for supporting the full Unicode
> character set based on the existing representation of text in ECMAScript
> using UTF-16 code unit sequences:
> http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/index.html
>
> The detailed proposed spec changes serve to get a good idea of the scope of
> the changes, but will need some polishing.
>
> Comments?
>
> Thanks,
> Norbert
>
> [1] https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html
>
> _______________________________________________
> es-discuss mailing list
> [email protected]
> https://mail.mozilla.org/listinfo/es-discuss
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss