Sorry for jumping between messages...

Roger Andrews wrote:
Doesn't C/C++ allow non-BMP code points using \Uxxxxxxxx in character literals. The \U format expresses a full 32-bit code, which could be mapped internally to two 16-bit UTF-16 codes.

Python uses the same syntax in regular expressions. But, as Norbert noted, there is already a strawman for \u{X..}. If it were adopted, I think it is clear that it should also be extended to RegExp literals. Of course, this adds some complication when referencing numbers above FFFF unless /u is made the default everywhere, since it implies code-point-based matching. E.g., what does /[^\0-\uFFFF\u{10000}]/ without /u match?

The example above also hints at additional potentially breaking changes for code point matching by default that haven't yet been discussed in this thread: that the meaning of negated character classes and shorthands would change, and that their match length may be 2 (like the dot).

Roger Andrews wrote:
I'm struggling with how non-BMP escapes would be used in practice in strings & regexps -- especially regexp ranges. Will Strawman style be used in string & regexp literals?
[...examples snipped]

I'm not sure whether this was already clear, but the curly braces I included in my paraphrasing of Norbert's proposed transformations were not meant to be included literally. I was trying to describe ranges between arbitrary code points, represented by pairs of high and low surrogates. As far as I understand, no existing proposal would allow a character class range written as /[{\uD834\uDF07}-{\uD834\uDF56}]/ to work correctly, with or without the curly braces. To match a range outside the BMP in a literal RegExp, you would have to use [<char>-<char>] (where <char> represents a literal character, and this may require flag /u to work) or [\u{X..}-\u{X..}] (assuming support for this syntax is added, and where X.. represents a hex number between 0 and at least 10FFFF).

Norbert Lindenberg wrote:
[...snip] My first proposal was to make this happen even without a new
flag, i.e., make
"𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/)
work based on code points, and Steve is arguing against that because of
compatibility risk. My proposal also includes some transformations to keep
existing regular expressions working, and Steve correctly observes that if
we have a flag for code point mode, then the transformation is not needed - old regular expressions would continue to work in code unit mode, while new
regular expressions with /u get code point treatment.

Although I've argued the compatibility risk angle, on that point I should defer to implementers and others who might have a better sense of the scope of risk/damage to existing programs. More personally affecting, though, is the negative gut reaction I have to the well-thought-out but ugly and complicated (not so much in implementation, but for devs who have to learn about it) transformations that would otherwise be necessary to avoid breaking current regexes. And like David, I think just requiring /u is not so bad, especially since I'd want to use it for its other meanings anyway.

I'm also nervous about using different default semantics inside and out of ES6 modules, but David Herman has already well articulated my concerns and you've already responded, so I'll leave that discussion to you two except to say that if /u is added as a general "fix Unicode" flag, IMO it's reasonable to automatically turn it on in ES6 modules. That's because I think applying only code unit to code point mode switching in modules by default is too magical and confusing, but if it were described as turning on /u by default, that's easy to understand and explain.

Lasse Reichstein wrote:
Steven Levithan wrote:
I've been wondering whether it might be best for the /u flag to do three
things at once, making it an all-around "support Unicode better" flag:
[...]
3. [New proposal] Makes /i use Unicode casefolding rules.

Yey, I'm for it :)
Especially if it means dropping the rather naïve canonicalize function
that can't canonicalize an ASCII character with a non-ASCII character.

That would be my hope as well.

Norbert Lindenberg wrote:
One concern: I think code point based matching should be the default for
regex literals within modules (where we know the code is written for
Harmony). Does it make sense to also interpret \d\D\w\W\b\B as full
Unicode sets for such literals?

As I said above, yes, I think it makes sense to apply all semantics of /u by default within modules. Previously in this thread, I detailed what \d\w\b\s mean in various regex flavors. The ones that give Unicode meanings by default are .NET and Perl, so ES would be in excellent regex company. Additionally, Java's \b (only) supports Unicode by default, as does ES's \s.

Norbert Lindenberg wrote:
In the other direction it's clear that using /u for \d\D\w\W\b\B has to
imply code point mode.

Not if their meaning was limited to the BMP, which is already true for \D\W\B\S. /\D\D/.test(singleAstralNondigit) == true. With code point matching it would be false. Yet another reason to tie the multiple proposed meanings of /u together.

Norbert Lindenberg wrote:
Steven Levithan wrote:
3. [New proposal] Makes /i use Unicode casefolding rules.
/ΣΤΙΓΜΑΣ/iu.test("στιγμας") == true.

We probably should review the complete Unicode Technical Standard #18,
Unicode Regular Expressions, and see how we can upgrade RegExp for better
Unicode support. Maybe on a separate thread...

Agreed. You may already be thinking this, but IMO if we're going to add /u as a Little Red Switch (as David called it), the priority should be on making sure that /u gets all aspects of Unicode-aware regular expression semantics done right, before looking at new features from UTS#18 like Unicode property matching.

-- Steven Levithan
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Reply via email to