Re: Questions regarding ES6 Unicode regular expressions

Allen Wirfs-Brock Tue, 26 Aug 2014 10:01:37 -0700

I've thought about this a bit. I was initially inclined to agree with the idea 
of extending the existing character classes similar to what Mathias' proposes.  
But I now think that is probably not a very good idea and that what is 
currently spec'ed (essentially that the /u flag doesn't change the meaning of 
\w, \d, etc.) is the better path.


The basic issue I see is backwards compatibility and evolving code to using /u 
patterns.

I suspect that there is plenty of JS code in the world that does something more 
or less equivalent to `parseInt(/\s*(\d+)/.exec(input)[1])`

Note that `parseInt` is only prepared to recognize the digits U+0030-U+0039.

I t seems to me, that we want programmers to start migrating to full Unicode 
regular expressions without having to do major logic rewrite of their code.  
For example, ideally the above expression could simply be replaced by 
`parseInt(/\s*(\d+)/u.exec(input)[1])` and everything in the application could 
continue to work unchanged.

That won't be the case if we redefine, as Mathias proposes, `/\d/u` to be 
equivalent to 
`/[0-9\u0660-\u0669\u06F0-\u06F9\u07C0-\u07C9\u0966-\u096F\u09E6-\u09EF\u0A66-\u0A6F\u0AE6-\u0AEF\u0B66-\u0B6F\u0BE6-\u0BEF\u0C66-\u0C6F\u0CE6-\u0CEF\u0D66-\u0D6F\u0DE6-\u0DEF\u0E50-\u0E59\u0ED0-\u0ED9\u0F20-\u0F29\u1040-\u1049\u1090-\u1099\u17E0-\u17E9\u1810-\u1819\u1946-\u194F\u19D0-\u19D9\u1A80-\u1A89\u1A90-\u1A99\u1B50-\u1B59\u1BB0-\u1BB9\u1C40-\u1C49\u1C50-\u1C59\uA620-\uA629\uA8D0-\uA8D9\uA900-\uA909\uA9D0-\uA9D9\uA9F0-\uA9F9\uAA50-\uAA59\uABF0-\uABF9\uFF10-\uFF19]|\uD801[\uDCA0-\uDCA9]|\uD804[\uDC66-\uDC6F\uDCF0-\uDCF9\uDD36-\uDD3F\uDDD0-\uDDD9\uDEF0-\uDEF9]|\uD805[\uDCD0-\uDCD9\uDE50-\uDE59\uDEC0-\uDEC9]|\uD806[\uDCE0-\uDCE9]|\uD81A[\uDE60-\uDE69\uDF50-\uDF59]|\uD835[\uDFCE-\uDFFF]/u`
rather than

`/[0-9]/u`
We can apply similar logic to \w and even \s.
Instead, we should leave the definitions of \d, \w and \s unchanged and plan to 
adopt the already established convention that `\p{<Unicode property>}` is the 
notation for matching Unicode categories. See 
http://www.regular-expressions.info/unicode.html 
I think digesting all the \p{} possibilities is too much to do for ES6, so I 
suggest that for ES6 that we simply reserve the \p{<characters>} and 
\P{<characters>} syntax within /u patterns.  A \p proposal can then be 
developed for ES7.
I see one remaining issue:
In ES5 (and ES6): `/a-z/i`  does not match U+017F (ſ) or U+212A (K) because the 
ES canonicalization algorithm excludes mapping code points > 127 that 
toUpperCase to code points <128.
However, as currently spec'ed, the ES6 canonicalization algorithm for /u 
RegExps does not include that >127/<128 exclusion.  It maps U+017F to "S" which 
matches. 
This is probably a minor variation, from the ES5 behavior, but we should 
probably be sure it is a desirable and tolerable change as we presumably could 
also apply the >127/<128 filter to /u canonicalization.

So, here is a summary of my proposal:
1) don't change the current definitions of \d, \w, \s when used in /u regular 
expressions.
2) Decide whether the current ES6 /u canonicalization algorithm is correct or 
if it should not translated code points > 127 that map to code points <128.
3) Reserve within /u RegExp patterns, the syntax \p{<characters>} and 
\P{<characters>}
4) Start to develop a \p{ } proposal for ES7.
Allen

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Questions regarding ES6 Unicode regular expressions

Reply via email to