Norbert Lindenberg wrote:

I've updated the proposal based on the feedback received so far. Changes
are listed in the Updates section.
http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/

Cool.

From the proposal's Updates section:

Indicated that "u" may not be the actual character for the flag for code
point mode in regular expressions, as a "u" flag has already been proposed
for Unicode-aware digit and word character matching.

I've been wondering whether it might be best for the /u flag to do three things at once, making it an all-around "support Unicode better" flag:

1. Switches from code unit to code point mode. /./gu matches any Unicode code point, among other benefits outlined by Norbert.

2. Makes \d\D\w\W\b\B match Unicode decimal digits and word characters. [0-9], [A-Za-z0-9_], and lookaround provide fallbacks if you want to match ASCII characters only while using /u.

3. [New proposal] Makes /i use Unicode casefolding rules. /ΣΤΙΓΜΑΣ/iu.test("στιγμας") == true.

Item number 3 is inspired by but different than Java's lowercase u flag for Unicode casefolding. In Java, flag u itself enables Unicode casefolding and does not need to be paired with flag i (which is equivalent to ES's /i).

As an aside, merging these three things would likely lead to /u seeing widespread use when dealing with anything more than ASCII, at least in environments where you don't have to worry about backcompat. This would help developers avoid stumbling on code unit issues in the small minority of cases where non-BMP characters are used or encountered. If /u's only purpose was to switch to code point mode, most likely it would be used *far* less often, and more developers would continue to get bitten by code-unit-based processing.

As for whether the switch to code-point-based matching should be universal or require /u (an issue that your proposal leaves open), IMHO it's better to require /u since it avoids the need for transforming \uxxxx[\uyyyy-\uzzzz] to [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}] and [\uwwww-\uxxxx][\uDC00-\uDFFF] to [{\uwwww\uDC00}-{\uxxxx\uDFFF}], and additionally avoids as least three potentially breaking changes (two of which are explicitly mentioned in your proposal):

1. "[S]ome applications might have processed gunk with regular expressions where neither the 'characters' in the patterns nor the input to be matched are text."

2. "s.match(/^.$/)[0].length can now be 2."
I'll add, /.{3}/.exec(s)[0].length can now be anywhere between 3 and 6.

3. /./g.exec(s) can now increment the regex's lastIndex by 2.

-- Steven Levithan


_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Reply via email to