Re: Full Unicode based on UTF-16 proposal

Steven Levithan Fri, 23 Mar 2012 06:31:42 -0700

Norbert Lindenberg wrote:

I've updated the proposal based on the feedback received so far. Changes
are listed in the Updates section.
http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/


Cool.

From the proposal's Updates section:

Indicated that "u" may not be the actual character for the flag for code
point mode in regular expressions, as a "u" flag has already been proposed
for Unicode-aware digit and word character matching.

I've been wondering whether it might be best for the /u flag to do threethings at once, making it an all-around "support Unicode better" flag:

1. Switches from code unit to code point mode. /./gu matches any Unicodecode point, among other benefits outlined by Norbert.

2. Makes \d\D\w\W\b\B match Unicode decimal digits and word characters.[0-9], [A-Za-z0-9_], and lookaround provide fallbacks if you want to matchASCII characters only while using /u.

3. [New proposal] Makes /i use Unicode casefolding rules./ΣΤΙΓΜΑΣ/iu.test("στιγμας") == true.

Item number 3 is inspired by but different than Java's lowercase u flag forUnicode casefolding. In Java, flag u itself enables Unicode casefolding anddoes not need to be paired with flag i (which is equivalent to ES's /i).

As an aside, merging these three things would likely lead to /u seeingwidespread use when dealing with anything more than ASCII, at least inenvironments where you don't have to worry about backcompat. This would helpdevelopers avoid stumbling on code unit issues in the small minority ofcases where non-BMP characters are used or encountered. If /u's only purposewas to switch to code point mode, most likely it would be used *far* lessoften, and more developers would continue to get bitten by code-unit-basedprocessing.

As for whether the switch to code-point-based matching should be universalor require /u (an issue that your proposal leaves open), IMHO it's better torequire /u since it avoids the need for transforming \uxxxx[\uyyyy-\uzzzz]to [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}] and [\uwwww-\uxxxx][\uDC00-\uDFFF] to[{\uwwww\uDC00}-{\uxxxx\uDFFF}], and additionally avoids as least threepotentially breaking changes (two of which are explicitly mentioned in yourproposal):

1. "[S]ome applications might have processed gunk with regular expressionswhere neither the 'characters' in the patterns nor the input to be matchedare text."


2. "s.match(/^.$/)[0].length can now be 2."
I'll add, /.{3}/.exec(s)[0].length can now be anywhere between 3 and 6.

3. /./g.exec(s) can now increment the regex's lastIndex by 2.

-- Steven Levithan


_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

Reply via email to