On Mar 23, 2012, at 6:30 AM, Steven Levithan wrote:
> I've been wondering whether it might be best for the /u flag to do three
> things at once, making it an all-around "support Unicode better" flag:
+all my internet points
Now you're talking!!
> 1. Switches from code unit to code point mode. /./gu matches any Unicode code
> point, among other benefits outlined by Norbert.
>
> 2. Makes \d\D\w\W\b\B match Unicode decimal digits and word characters.
> [0-9], [A-Za-z0-9_], and lookaround provide fallbacks if you want to match
> ASCII characters only while using /u.
>
> 3. [New proposal] Makes /i use Unicode casefolding rules.
> /ΣΤΙΓΜΑΣ/iu.test("στιγμας") == true.
This is really exciting.
> As for whether the switch to code-point-based matching should be universal or
> require /u (an issue that your proposal leaves open), IMHO it's better to
> require /u since it avoids the need for transforming \uxxxx[\uyyyy-\uzzzz] to
> [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}] and [\uwwww-\uxxxx][\uDC00-\uDFFF] to
> [{\uwwww\uDC00}-{\uxxxx\uDFFF}], and additionally avoids as least three
> potentially breaking changes (two of which are explicitly mentioned in your
> proposal):
I haven't completely understood this part of the discussion. Looking at /u as a
"little red switch" (LRS), i.e., an opportunity to make judicious breaks with
compatibility, could we not allow character classes with unescaped non-BMP code
points, e.g.:
js> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/u)
["𝌆𝌇𝌈𝌉𝌊"]
I'm still getting up to speed on Unicode and JS string semantics, so I'm
guessing that I'm missing a reason why that wouldn't work... Presumably the JS
source, as a sequence of UTF-16 code units, represents the tetragram code
points as surrogate pairs. Can we not recognize surrogate pairs in character
classes within a /u regexp and interpret them as code points?
Dave
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss