Sorry for jumping between messages...
Roger Andrews wrote:
Doesn't C/C++ allow non-BMP code points using \Uxxxxxxxx in character
literals. The \U format expresses a full 32-bit code, which could be
mapped internally to two 16-bit UTF-16 codes.
Python uses the same syntax in regular expressions. But, as Norbert noted,
there is already a strawman for \u{X..}. If it were adopted, I think it is
clear that it should also be extended to RegExp literals. Of course, this
adds some complication when referencing numbers above FFFF unless /u is made
the default everywhere, since it implies code-point-based matching. E.g.,
what does /[^\0-\uFFFF\u{10000}]/ without /u match?
The example above also hints at additional potentially breaking changes for
code point matching by default that haven't yet been discussed in this
thread: that the meaning of negated character classes and shorthands would
change, and that their match length may be 2 (like the dot).
Roger Andrews wrote:
I'm struggling with how non-BMP escapes would be used in practice in
strings
& regexps -- especially regexp ranges. Will Strawman style be used in
string & regexp literals?
[...examples snipped]
I'm not sure whether this was already clear, but the curly braces I included
in my paraphrasing of Norbert's proposed transformations were not meant to
be included literally. I was trying to describe ranges between arbitrary
code points, represented by pairs of high and low surrogates. As far as I
understand, no existing proposal would allow a character class range written
as /[{\uD834\uDF07}-{\uD834\uDF56}]/ to work correctly, with or without the
curly braces. To match a range outside the BMP in a literal RegExp, you
would have to use [<char>-<char>] (where <char> represents a literal
character, and this may require flag /u to work) or [\u{X..}-\u{X..}]
(assuming support for this syntax is added, and where X.. represents a hex
number between 0 and at least 10FFFF).
Norbert Lindenberg wrote:
[...snip] My first proposal was to make this happen even without a new
flag, i.e., make
"𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/)
work based on code points, and Steve is arguing against that because of
compatibility risk. My proposal also includes some transformations to keep
existing regular expressions working, and Steve correctly observes that if
we have a flag for code point mode, then the transformation is not
needed -
old regular expressions would continue to work in code unit mode, while
new
regular expressions with /u get code point treatment.
Although I've argued the compatibility risk angle, on that point I should
defer to implementers and others who might have a better sense of the scope
of risk/damage to existing programs. More personally affecting, though, is
the negative gut reaction I have to the well-thought-out but ugly and
complicated (not so much in implementation, but for devs who have to learn
about it) transformations that would otherwise be necessary to avoid
breaking current regexes. And like David, I think just requiring /u is not
so bad, especially since I'd want to use it for its other meanings anyway.
I'm also nervous about using different default semantics inside and out of
ES6 modules, but David Herman has already well articulated my concerns and
you've already responded, so I'll leave that discussion to you two except to
say that if /u is added as a general "fix Unicode" flag, IMO it's reasonable
to automatically turn it on in ES6 modules. That's because I think applying
only code unit to code point mode switching in modules by default is too
magical and confusing, but if it were described as turning on /u by default,
that's easy to understand and explain.
Lasse Reichstein wrote:
Steven Levithan wrote:
I've been wondering whether it might be best for the /u flag to do three
things at once, making it an all-around "support Unicode better" flag:
[...]
3. [New proposal] Makes /i use Unicode casefolding rules.
Yey, I'm for it :)
Especially if it means dropping the rather naïve canonicalize function
that can't canonicalize an ASCII character with a non-ASCII character.
That would be my hope as well.
Norbert Lindenberg wrote:
One concern: I think code point based matching should be the default for
regex literals within modules (where we know the code is written for
Harmony). Does it make sense to also interpret \d\D\w\W\b\B as full
Unicode sets for such literals?
As I said above, yes, I think it makes sense to apply all semantics of /u by
default within modules. Previously in this thread, I detailed what \d\w\b\s
mean in various regex flavors. The ones that give Unicode meanings by
default are .NET and Perl, so ES would be in excellent regex company.
Additionally, Java's \b (only) supports Unicode by default, as does ES's \s.
Norbert Lindenberg wrote:
In the other direction it's clear that using /u for \d\D\w\W\b\B has to
imply code point mode.
Not if their meaning was limited to the BMP, which is already true for
\D\W\B\S. /\D\D/.test(singleAstralNondigit) == true. With code point
matching it would be false. Yet another reason to tie the multiple proposed
meanings of /u together.
Norbert Lindenberg wrote:
Steven Levithan wrote:
3. [New proposal] Makes /i use Unicode casefolding rules.
/ΣΤΙΓΜΑΣ/iu.test("στιγμας") == true.
We probably should review the complete Unicode Technical Standard #18,
Unicode Regular Expressions, and see how we can upgrade RegExp for better
Unicode support. Maybe on a separate thread...
Agreed. You may already be thinking this, but IMO if we're going to add /u
as a Little Red Switch (as David called it), the priority should be on
making sure that /u gets all aspects of Unicode-aware regular expression
semantics done right, before looking at new features from UTS#18 like
Unicode property matching.
-- Steven Levithan
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss