I've thought about this a bit. I was initially inclined to agree with the idea
of extending the existing character classes similar to what Mathias' proposes.
But I now think that is probably not a very good idea and that what is
currently spec'ed (essentially that the /u flag doesn't change the meaning of
\w, \d, etc.) is the better path.
The basic issue I see is backwards compatibility and evolving code to using /u
patterns.
I suspect that there is plenty of JS code in the world that does something more
or less equivalent to `parseInt(/\s*(\d+)/.exec(input)[1])`
Note that `parseInt` is only prepared to recognize the digits U+0030-U+0039.
I t seems to me, that we want programmers to start migrating to full Unicode
regular expressions without having to do major logic rewrite of their code.
For example, ideally the above expression could simply be replaced by
`parseInt(/\s*(\d+)/u.exec(input)[1])` and everything in the application could
continue to work unchanged.
That won't be the case if we redefine, as Mathias proposes, `/\d/u` to be
equivalent to
`/[0-9\u0660-\u0669\u06F0-\u06F9\u07C0-\u07C9\u0966-\u096F\u09E6-\u09EF\u0A66-\u0A6F\u0AE6-\u0AEF\u0B66-\u0B6F\u0BE6-\u0BEF\u0C66-\u0C6F\u0CE6-\u0CEF\u0D66-\u0D6F\u0DE6-\u0DEF\u0E50-\u0E59\u0ED0-\u0ED9\u0F20-\u0F29\u1040-\u1049\u1090-\u1099\u17E0-\u17E9\u1810-\u1819\u1946-\u194F\u19D0-\u19D9\u1A80-\u1A89\u1A90-\u1A99\u1B50-\u1B59\u1BB0-\u1BB9\u1C40-\u1C49\u1C50-\u1C59\uA620-\uA629\uA8D0-\uA8D9\uA900-\uA909\uA9D0-\uA9D9\uA9F0-\uA9F9\uAA50-\uAA59\uABF0-\uABF9\uFF10-\uFF19]|\uD801[\uDCA0-\uDCA9]|\uD804[\uDC66-\uDC6F\uDCF0-\uDCF9\uDD36-\uDD3F\uDDD0-\uDDD9\uDEF0-\uDEF9]|\uD805[\uDCD0-\uDCD9\uDE50-\uDE59\uDEC0-\uDEC9]|\uD806[\uDCE0-\uDCE9]|\uD81A[\uDE60-\uDE69\uDF50-\uDF59]|\uD835[\uDFCE-\uDFFF]/u`
rather than
`/[0-9]/u`
We can apply similar logic to \w and even \s.
Instead, we should leave the definitions of \d, \w and \s unchanged and plan to
adopt the already established convention that `\p{<Unicode property>}` is the
notation for matching Unicode categories. See
http://www.regular-expressions.info/unicode.html
I think digesting all the \p{} possibilities is too much to do for ES6, so I
suggest that for ES6 that we simply reserve the \p{<characters>} and
\P{<characters>} syntax within /u patterns. A \p proposal can then be
developed for ES7.
I see one remaining issue:
In ES5 (and ES6): `/a-z/i` does not match U+017F (ſ) or U+212A (K) because the
ES canonicalization algorithm excludes mapping code points > 127 that
toUpperCase to code points <128.
However, as currently spec'ed, the ES6 canonicalization algorithm for /u
RegExps does not include that >127/<128 exclusion. It maps U+017F to "S" which
matches.
This is probably a minor variation, from the ES5 behavior, but we should
probably be sure it is a desirable and tolerable change as we presumably could
also apply the >127/<128 filter to /u canonicalization.
So, here is a summary of my proposal:
1) don't change the current definitions of \d, \w, \s when used in /u regular
expressions.
2) Decide whether the current ES6 /u canonicalization algorithm is correct or
if it should not translated code points > 127 that map to code points <128.
3) Reserve within /u RegExp patterns, the syntax \p{<characters>} and
\P{<characters>}
4) Start to develop a \p{ } proposal for ES7.
Allen
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss