On Mar 24, 2012, at 12:21 , David Herman wrote:
[snip]
>> As for whether the switch to code-point-based matching should be universal
>> or require /u (an issue that your proposal leaves open), IMHO it's better to
>> require /u since it avoids the need for transforming \uxxxx[\uyyyy-\uzzzz]
>> to [{\uxxxx\uyyyy}-{\uxxxx\uzzzz}] and [\uwwww-\uxxxx][\uDC00-\uDFFF] to
>> [{\uwwww\uDC00}-{\uxxxx\uDFFF}], and additionally avoids as least three
>> potentially breaking changes (two of which are explicitly mentioned in your
>> proposal):
>
> I haven't completely understood this part of the discussion. Looking at /u as
> a "little red switch" (LRS), i.e., an opportunity to make judicious breaks
> with compatibility, could we not allow character classes with unescaped
> non-BMP code points, e.g.:
>
> js> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/u)
> ["𝌆𝌇𝌈𝌉𝌊"]
>
> I'm still getting up to speed on Unicode and JS string semantics, so I'm
> guessing that I'm missing a reason why that wouldn't work... Presumably the
> JS source of the regexp literal, as a sequence of UTF-16 code units,
> represents the tetragram code points as surrogate pairs. Can we not recognize
> surrogate pairs in character classes within a /u regexp and interpret them as
> code points?
With /u, that's exactly what happens. My first proposal was to make this happen
even without a new flag, i.e., make
"𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/)
work based on code points, and Steve is arguing against that because of
compatibility risk. My proposal also includes some transformations to keep
existing regular expressions working, and Steve correctly observes that if we
have a flag for code point mode, then the transformation is not needed - old
regular expressions would continue to work in code unit mode, while new regular
expressions with /u get code point treatment.
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss