On Mar 26, 2012, at 9:45 , Steven Levithan wrote:
> Sorry for jumping between messages...
>
> Roger Andrews wrote:
>> Doesn't C/C++ allow non-BMP code points using \Uxxxxxxxx in character
>> literals. The \U format expresses a full 32-bit code, which could be mapped
>> internally to two 16-bit UTF-16 codes.
>
> Python uses the same syntax in regular expressions. But, as Norbert noted,
> there is already a strawman for \u{X..}. If it were adopted, I think it is
> clear that it should also be extended to RegExp literals. Of course, this
> adds some complication when referencing numbers above FFFF unless /u is made
> the default everywhere, since it implies code-point-based matching. E.g.,
> what does /[^\0-\uFFFF\u{10000}]/ without /u match?
As long as the underlying system is UTF-16 based, I'd think \u{10000} is simply
a different notation for \uD800\uDC00. But with code unit based matching that
will not result in the intended behavior.
> The example above also hints at additional potentially breaking changes for
> code point matching by default that haven't yet been discussed in this
> thread: that the meaning of negated character classes and shorthands would
> change, and that their match length may be 2 (like the dot).
Yes.
> Roger Andrews wrote:
>> I'm struggling with how non-BMP escapes would be used in practice in strings
>> & regexps -- especially regexp ranges. Will Strawman style be used in
>> string & regexp literals?
>> [...examples snipped]
>
> I'm not sure whether this was already clear, but the curly braces I included
> in my paraphrasing of Norbert's proposed transformations were not meant to be
> included literally. I was trying to describe ranges between arbitrary code
> points, represented by pairs of high and low surrogates. As far as I
> understand, no existing proposal would allow a character class range written
> as /[{\uD834\uDF07}-{\uD834\uDF56}]/ to work correctly, with or without the
> curly braces. To match a range outside the BMP in a literal RegExp, you would
> have to use [<char>-<char>] (where <char> represents a literal character, and
> this may require flag /u to work) or [\u{X..}-\u{X..}] (assuming support for
> this syntax is added, and where X.. represents a hex number between 0 and at
> least 10FFFF).
In my proposal the following two regular expressions are equivalent:
/[π-π]+/u
/[\uD834\uDF06-\uD834\uDF56]+/u
They are made equivalent by the first preprocessing step proposed for 15.10.4.1
and the subsequent interpretation of UTF-16 sequences as code points.
I think I'd process Unicode code point escapes by first converting them to
equivalent code unit escapes and then following the same path. This would make
/[\u{1D306}-\u{1D356}]+/u
equivalent to the two above.
> Norbert Lindenberg wrote:
>> [...snip] My first proposal was to make this happen even without a new
>> flag, i.e., make
>> "πππππ".match(/[π-π]+/)
>> work based on code points, and Steve is arguing against that because of
>> compatibility risk. My proposal also includes some transformations to keep
>> existing regular expressions working, and Steve correctly observes that if
>> we have a flag for code point mode, then the transformation is not needed -
>> old regular expressions would continue to work in code unit mode, while new
>> regular expressions with /u get code point treatment.
>
> Although I've argued the compatibility risk angle, on that point I should
> defer to implementers and others who might have a better sense of the scope
> of risk/damage to existing programs. More personally affecting, though, is
> the negative gut reaction I have to the well-thought-out but ugly and
> complicated (not so much in implementation, but for devs who have to learn
> about it) transformations that would otherwise be necessary to avoid breaking
> current regexes. And like David, I think just requiring /u is not so bad,
> especially since I'd want to use it for its other meanings anyway.
>
> I'm also nervous about using different default semantics inside and out of
> ES6 modules, but David Herman has already well articulated my concerns and
> you've already responded, so I'll leave that discussion to you two except to
> say that if /u is added as a general "fix Unicode" flag, IMO it's reasonable
> to automatically turn it on in ES6 modules. That's because I think applying
> only code unit to code point mode switching in modules by default is too
> magical and confusing, but if it were described as turning on /u by default,
> that's easy to understand and explain.
Good input.
> Lasse Reichstein wrote:
>> Steven Levithan wrote:
>>> I've been wondering whether it might be best for the /u flag to do three
>>> things at once, making it an all-around "support Unicode better" flag:
>>> [...]
>>> 3. [New proposal] Makes /i use Unicode casefolding rules.
>>
>> Yey, I'm for it :)
>> Especially if it means dropping the rather naΓ―ve canonicalize function
>> that can't canonicalize an ASCII character with a non-ASCII character.
>
> That would be my hope as well.
>
> Norbert Lindenberg wrote:
>> One concern: I think code point based matching should be the default for
>> regex literals within modules (where we know the code is written for
>> Harmony). Does it make sense to also interpret \d\D\w\W\b\B as full
>> Unicode sets for such literals?
>
> As I said above, yes, I think it makes sense to apply all semantics of /u by
> default within modules. Previously in this thread, I detailed what \d\w\b\s
> mean in various regex flavors. The ones that give Unicode meanings by default
> are .NET and Perl, so ES would be in excellent regex company. Additionally,
> Java's \b (only) supports Unicode by default, as does ES's \s.
>
> Norbert Lindenberg wrote:
>> In the other direction it's clear that using /u for \d\D\w\W\b\B has to
>> imply code point mode.
>
> Not if their meaning was limited to the BMP, which is already true for
> \D\W\B\S. /\D\D/.test(singleAstralNondigit) == true. With code point matching
> it would be false. Yet another reason to tie the multiple proposed meanings
> of /u together.
>
> Norbert Lindenberg wrote:
>> Steven Levithan wrote:
>>> 3. [New proposal] Makes /i use Unicode casefolding rules.
>>> /Σ΀ΞΞΞΞΞ£/iu.test("ΟΟΞΉΞ³ΞΌΞ±Ο") == true.
>>
>> We probably should review the complete Unicode Technical Standard #18,
>> Unicode Regular Expressions, and see how we can upgrade RegExp for better
>> Unicode support. Maybe on a separate thread...
>
> Agreed. You may already be thinking this, but IMO if we're going to add /u as
> a Little Red Switch (as David called it), the priority should be on making
> sure that /u gets all aspects of Unicode-aware regular expression semantics
> done right, before looking at new features from UTS#18 like Unicode property
> matching.
>
> -- Steven Levithan
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss