Re: Full Unicode based on UTF-16 proposal

Erik Corry Sat, 17 Mar 2012 10:20:44 -0700

2012/3/17 Steven L. <steves_l...@hotmail.com>:
> Eric Corry wrote:
>>
>> However I think we probably do want the /u modifier on regexps to
>> control the new backward-incompatible behaviour.  There may be some
>> way to relax this for regexp literals in opted in Harmony code, but
>> for new RegExp(...) and for other string literals I think there are
>> rather too many inconsistencies with the old behaviour.
>
>
> Disagree with adding /u for this purpose and disagree with breaking backward
> compatibility to let `/./.exec(s)[0].length == 2`.

Care to enlighten us with any thinking behind this disagreeing?

> Instead, if this is
> deemed an important enough issue, there are two ways to match any Unicode
> grapheme that match existing regex library precedent:
>
> From Perl and PCRE:
>
> \X

This doesn't work inside [].  Were you envisioning the same restriction in JS?

Also it matches a grapheme cluster, which is may be useful but is
completely different to what the dot does.

> From Perl, PCRE, .NET, Java, XML Schema, and ICU (among others):
>
> \P{M}\p{M}*
>
> Obviously \X is prettier, but because it's fairly rare for people to care
> about this, IMO the more widely compatible solution that uses Unicode
> categories is Good Enough if Unicode category syntax is on the table for
> ES6.
>
> Norbert Lindenberg wrote:
>>
>> \uxxxx[\uyyyy-\uzzzz] is interpreted as [\uxxxx\uyyyy-\uxxxx\uzzzz]

Norbert, this just happens automatically if unmatched surrogates are
just treated as if they were normal code units.

>> [\uwwww-\uxxxx][\uyyyy-\uzzzz] is interpreted as
>> [\uwwww\uyyyy-\uxxxx\uzzzz]

Norbert, this will have different semantics to the current
implementations unless the second range is the full trail surrogate
range.

I agree with Steven that these two cases should just be left alone,
which means they will continue to work the way they have until now.

> Some people will want a way to match arbitrary Unicode code
> points rather than graphemes anyway, so leaving \uhhhh alone lets that use
> case continue working. This would still allow modifying the handling of
> literal astral/supplementary characters in RegExps. If it can be handled
> sensibly, I'm all for treating literal characters in RegExps as discrete
> graphemes rather than splitting them into surrogate pairs.

You seem to be confusing graphemes and unicode code points.  Here is
the same text 3 times:

Four UTF-16 code units:

0x0020 0xD800 0xDF30 0x0308

Three Unicode code points:

0x20 0x10330 0x308

Two Graphemes

" " "¨"  <-- This is an attempt to show a Gothic Ahsa with an umlaut.
My mail program probably screwed it up.

The proposal you are responding to is all about adding Unicode code
point handling to regexps.  It is not about adding grapheme support,
which is a rather different issue.

-- 
Erik Corry
_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

Reply via email to