Re: Questions regarding ES6 Unicode regular expressions

Mathias Bynens Tue, 26 Aug 2014 11:16:00 -0700

On 26 Aug 2014, at 19:01, Allen Wirfs-Brock <[email protected]> wrote:


> I've thought about this a bit. I was initially inclined to agree with the 
> idea of extending the existing character classes similar to what Mathias' 
> proposes.  But I now think that is probably not a very good idea and that 
> what is currently spec'ed (essentially that the /u flag doesn't change the 
> meaning of \w, \d, etc.) is the better path. […] It seems to me, that we want 
> programmers to start migrating to full Unicode regular expressions without 
> having to do major logic rewrite of their code.  For example, ideally the 
> above expression could simply be replaced by 
> `parseInt(/\s*(\d+)/u.exec(input)[1])` and everything in the application 
> could continue to work unchanged.

I see your point, but I disagree with the notion that we must absolutely 
maintain backwards compatibility in this case. The fact that the new flag is 
opt-in gives us an opportunity to improve behavior without obsessing about 
back-compat, similar to how the strict mode opt-in is used to make all sorts of 
things better. When [evangelizing 
`/u`](https://mathiasbynens.be/notes/es6-unicode-regex), we can educate 
developers and tell them to not blindly/needlessly add `/u` to their existing 
regular expressions.

> Instead, we should leave the definitions of \d, \w and \s unchanged and plan 
> to adopt the already established convention that `\p{<Unicode property>}` is 
> the notation for matching Unicode categories. See 
> http://www.regular-expressions.info/unicode.html 

We could do both: improve `\d` and `\w` now, and add `\p{property}` and 
`\P{property}` later. Anyhow, I’ve filed 
https://bugs.ecmascript.org/show_bug.cgi?id=3157 for reserving `\p{…}`/`\P{…}`.

> I think digesting all the \p{} possibilities is too much to do for ES6, so I 
> suggest that for ES6 that we simply reserve the \p{<characters>} and 
> \P{<characters>} syntax within /u patterns.  A \p proposal can then be 
> developed for ES7.

Sounds good to me.

> I see one remaining issue:
> In ES5 (and ES6): `/a-z/i`  does not match U+017F (ſ) or U+212A (K) because 
> the ES canonicalization algorithm excludes mapping code points > 127 that 
> toUpperCase to code points <128.
> However, as currently spec'ed, the ES6 canonicalization algorithm for /u 
> RegExps does not include that >127/<128 exclusion.  It maps U+017F to "S" 
> which matches. 
> This is probably a minor variation, from the ES5 behavior, but we should 
> probably be sure it is a desirable and tolerable change as we presumably 
> could also apply the >127/<128 filter to /u canonicalization.

This is a useful feature, and the explicit opt-in makes the small back-compat 
break acceptable IMHO.

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Questions regarding ES6 Unicode regular expressions

Reply via email to