Norbert Lindenberg wrote:
I've updated the proposal based on the feedback received so far. Changes
are listed in the Updates section.
http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/
Cool.
From the proposal's Updates section:
Indicated that "u" may not be the actual character for the flag for code
point mode in regular expressions, as a "u" flag has already been proposed
for Unicode-aware digit and word character matching.
I've been wondering whether it might be best for the /u flag to do three
things at once, making it an all-around "support Unicode better" flag:
1. Switches from code unit to code point mode. /./gu matches any Unicode
code point, among other benefits outlined by Norbert.
2. Makes \d\D\w\W\b\B match Unicode decimal digits and word characters.
[0-9], [A-Za-z0-9_], and lookaround provide fallbacks if you want to match
ASCII characters only while using /u.
3. [New proposal] Makes /i use Unicode casefolding rules.
/ΣΤΙΓΜΑΣ/iu.test("στιγμας") == true.
Item number 3 is inspired by but different than Java's lowercase u flag for
Unicode casefolding. In Java, flag u itself enables Unicode casefolding and
does not need to be paired with flag i (which is equivalent to ES's /i).
As an aside, merging these three things would likely lead to /u seeing
widespread use when dealing with anything more than ASCII, at least in
environments where you don't have to worry about backcompat. This would help
developers avoid stumbling on code unit issues in the small minority of
cases where non-BMP characters are used or encountered. If /u's only purpose
was to switch to code point mode, most likely it would be used *far* less
often, and more developers would continue to get bitten by code-unit-based
processing.
As for whether the switch to code-point-based matching should be universal
or require /u (an issue that your proposal leaves open), IMHO it's better to
require /u since it avoids the need for transforming \uxxxx[\uyyyy-\uzzzz]
to [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}] and [\uwwww-\uxxxx][\uDC00-\uDFFF] to
[{\uwwww\uDC00}-{\uxxxx\uDFFF}], and additionally avoids as least three
potentially breaking changes (two of which are explicitly mentioned in your
proposal):
1. "[S]ome applications might have processed gunk with regular expressions
where neither the 'characters' in the patterns nor the input to be matched
are text."
2. "s.match(/^.$/)[0].length can now be 2."
I'll add, /.{3}/.exec(s)[0].length can now be anywhere between 3 and 6.
3. /./g.exec(s) can now increment the regex's lastIndex by 2.
-- Steven Levithan
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss