Re: Questions regarding ES6 Unicode regular expressions

Norbert Lindenberg Mon, 25 Aug 2014 17:17:08 -0700

On Aug 25, 2014, at 1:59 , Mathias Bynens <[email protected]> wrote:

> Norbert’s original proposal for the `u` flag 
> (http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/#RegExp)
>  mentioned the following:
> 
>> Possibly the definition of the character classes `\d\D\w\W\b\B` is extended 
>> to their Unicode extensions, such as all characters in the Unicode category 
>> “Number, decimal” for `\d`, as proposed by Steven Levithan. Whether this can 
>> be done under the same flag or requires a different one still needs 
>> discussion.
> 
> Has this been discussed any further? (I couldn’t find any mention of it in 
> the meeting notes repository.) Should I file a bug?


The “needs discussion” part actually came from the March 2012 TC39 meeting:
https://mail.mozilla.org/pipermail/es-discuss/2012-March/021919.html
We subsequently had some discussions about how to go about such a discussion, 
which petered out because no regular expression expert was available to work 
with.

I suspect this issue needs a proposal rather than a bug.

> Norbert also suggested replacing ‘characters’ with ‘code points’ in sections 
> like 
> https://people.mozilla.org/~jorendorff/es6-draft.html#sec-characterclassescape
>  and 
> https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-charactersetmatcher-abstract-operation
>  when the `u` flag is set. It seems the intent was to make e.g. `/\d/u` match 
> `/[0-9]/`, and `/\D/u` match all Unicode code points except `[0-9]`. This is 
> different from `/\D/` which only matches BMP code points.
> 
> It seems like this change has not propagated to the spec draft, though. Is 
> this correct, and if so, what’s the reason for that?

Technically that works out as intended because section 21.2.2 defines 
“character” differently depending on whether the “u” flag is used or not:
https://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics

It is somewhat confusing though, and it might be better to use a different spec 
mechanism. Ideas:
- Since we're processing based on Lists anyway, we could just use "element”.
- We could map code points to UTF-32 code units (1:1), and then consistently 
talk about code units, which would just have different sizes in the different 
modes.

> The same goes for `/[^a]/u` – should this match all Unicode code points 
> except `a` or should it only match BMP code points?

As above – the definition of CharSet depends on the “u” flag:
https://people.mozilla.org/~jorendorff/es6-draft.html#sec-notation

Norbert

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Questions regarding ES6 Unicode regular expressions

Reply via email to