2012/3/26 Gavin Barraclough <[email protected]>: > Hi Norbert, > > I really like the direction you're going in, but have one minor concern > relating to regular expressions. > > In your proposal, you currently state: > "A code unit that is in the range 0xD800 to 0xDFFF, but is not part of > a surrogate pair, is interpreted as a code point with the same value." > > I think this makes sense in the context of your original proposal, which > seeks to be backwards compatible with existing regular expressions through > the range transformations. But I'm concerned that this might prove > problematic, and would suggest that if we're going to make unicode regexp > match opt-in through a /u flag then instead it may be better to make unpaired > surrogates in unicode expressions a syntax error. > > My concern would be expressions such as: > /[\uD800\uDC00\uDC00\uD800]/u > Under my reading of the current proposal, this could match any of > "\uD800\uDC00", "\uD800", or "\uDC00". Allowing this seems to introduce the > concept of precedence to character classes (given an input "\uD800\uDC00", > should I choose to match "\uD800\uDC00" or "\uD800"?). It may also > significantly complicate the implementation of backtracking if we were to > allow this (if I have matched "\uD800\uDC00", should I step back by one code > unit or two?). > > It also just seems much clearer from a user perspective to say that > non-unicode regular expressions match code units, unicode regular expressions > match code points - mixing the two seems unhelpful. > > If opt-in is automatic in modules, programmers will likely want an escape to > be able to write non-unicode regular expressions, but I don't think this > should warrant an extra flag, I don't think we can automatically change the > behaviour of the RegExp constructor (without a "u" flag being passed), so > RegExp("\uD800") should still be available to support non-unicode matching > within modules.
This is too nasty. The regexp constructor should not have to look up the stack to see what behaviour is expected of it. -- Erik Corry > > cheers, > G. > > > On Mar 16, 2012, at 12:18 AM, Norbert Lindenberg wrote: > >> Based on my prioritization of goals for support for full Unicode in >> ECMAScript [1], I've put together a proposal for supporting the full Unicode >> character set based on the existing representation of text in ECMAScript >> using UTF-16 code unit sequences: >> http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/index.html >> >> The detailed proposed spec changes serve to get a good idea of the scope of >> the changes, but will need some polishing. >> >> Comments? >> >> Thanks, >> Norbert >> >> [1] https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html >> >> _______________________________________________ >> es-discuss mailing list >> [email protected] >> https://mail.mozilla.org/listinfo/es-discuss > > _______________________________________________ > es-discuss mailing list > [email protected] > https://mail.mozilla.org/listinfo/es-discuss _______________________________________________ es-discuss mailing list [email protected] https://mail.mozilla.org/listinfo/es-discuss

