On Mar 23, 2012, at 6:30 AM, Steven Levithan wrote:

> I've been wondering whether it might be best for the /u flag to do three 
> things at once, making it an all-around "support Unicode better" flag:

+all my internet points

Now you're talking!!

> 1. Switches from code unit to code point mode. /./gu matches any Unicode code 
> point, among other benefits outlined by Norbert.
> 
> 2. Makes \d\D\w\W\b\B match Unicode decimal digits and word characters. 
> [0-9], [A-Za-z0-9_], and lookaround provide fallbacks if you want to match 
> ASCII characters only while using /u.
> 
> 3. [New proposal] Makes /i use Unicode casefolding rules. 
> /ΣΤΙΓΜΑΣ/iu.test("στιγμας") == true.

This is really exciting.

> As for whether the switch to code-point-based matching should be universal or 
> require /u (an issue that your proposal leaves open), IMHO it's better to 
> require /u since it avoids the need for transforming \uxxxx[\uyyyy-\uzzzz] to 
> [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}] and [\uwwww-\uxxxx][\uDC00-\uDFFF] to 
> [{\uwwww\uDC00}-{\uxxxx\uDFFF}], and additionally avoids as least three 
> potentially breaking changes (two of which are explicitly mentioned in your 
> proposal):

I haven't completely understood this part of the discussion. Looking at /u as a 
"little red switch" (LRS), i.e., an opportunity to make judicious breaks with 
compatibility, could we not allow character classes with unescaped non-BMP code 
points, e.g.:

    js> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/u)
    ["𝌆𝌇𝌈𝌉𝌊"]

I'm still getting up to speed on Unicode and JS string semantics, so I'm 
guessing that I'm missing a reason why that wouldn't work... Presumably the JS 
source, as a sequence of UTF-16 code units, represents the tetragram code 
points as surrogate pairs. Can we not recognize surrogate pairs in character 
classes within a /u regexp and interpret them as code points?

Dave

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Reply via email to