Re: Full Unicode based on UTF-16 proposal

Steven Levithan Mon, 26 Mar 2012 09:45:43 -0700

Sorry for jumping between messages...

Roger Andrews wrote:

Doesn't C/C++ allow non-BMP code points using \Uxxxxxxxx in characterliterals. The \U format expresses a full 32-bit code, which could bemapped internally to two 16-bit UTF-16 codes.

Python uses the same syntax in regular expressions. But, as Norbert noted,there is already a strawman for \u{X..}. If it were adopted, I think it isclear that it should also be extended to RegExp literals. Of course, thisadds some complication when referencing numbers above FFFF unless /u is madethe default everywhere, since it implies code-point-based matching. E.g.,what does /[^\0-\uFFFF\u{10000}]/ without /u match?

The example above also hints at additional potentially breaking changes forcode point matching by default that haven't yet been discussed in thisthread: that the meaning of negated character classes and shorthands wouldchange, and that their match length may be 2 (like the dot).


Roger Andrews wrote:

I'm struggling with how non-BMP escapes would be used in practice instrings& regexps -- especially regexp ranges. Will Strawman style be used instring & regexp literals?
[...examples snipped]

I'm not sure whether this was already clear, but the curly braces I includedin my paraphrasing of Norbert's proposed transformations were not meant tobe included literally. I was trying to describe ranges between arbitrarycode points, represented by pairs of high and low surrogates. As far as Iunderstand, no existing proposal would allow a character class range writtenas /[{\uD834\uDF07}-{\uD834\uDF56}]/ to work correctly, with or without thecurly braces. To match a range outside the BMP in a literal RegExp, youwould have to use [<char>-<char>] (where <char> represents a literalcharacter, and this may require flag /u to work) or [\u{X..}-\u{X..}](assuming support for this syntax is added, and where X.. represents a hexnumber between 0 and at least 10FFFF).


Norbert Lindenberg wrote:

[...snip] My first proposal was to make this happen even without a new
flag, i.e., make
"𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/)
work based on code points, and Steve is arguing against that because of
compatibility risk. My proposal also includes some transformations to keep
existing regular expressions working, and Steve correctly observes that if

we have a flag for code point mode, then the transformation is notneeded -old regular expressions would continue to work in code unit mode, whilenew

regular expressions with /u get code point treatment.

Although I've argued the compatibility risk angle, on that point I shoulddefer to implementers and others who might have a better sense of the scopeof risk/damage to existing programs. More personally affecting, though, isthe negative gut reaction I have to the well-thought-out but ugly andcomplicated (not so much in implementation, but for devs who have to learnabout it) transformations that would otherwise be necessary to avoidbreaking current regexes. And like David, I think just requiring /u is notso bad, especially since I'd want to use it for its other meanings anyway.

I'm also nervous about using different default semantics inside and out ofES6 modules, but David Herman has already well articulated my concerns andyou've already responded, so I'll leave that discussion to you two except tosay that if /u is added as a general "fix Unicode" flag, IMO it's reasonableto automatically turn it on in ES6 modules. That's because I think applyingonly code unit to code point mode switching in modules by default is toomagical and confusing, but if it were described as turning on /u by default,that's easy to understand and explain.


Lasse Reichstein wrote:

Steven Levithan wrote:

I've been wondering whether it might be best for the /u flag to do three
things at once, making it an all-around "support Unicode better" flag:
[...]
3. [New proposal] Makes /i use Unicode casefolding rules.


Yey, I'm for it :)
Especially if it means dropping the rather naïve canonicalize function
that can't canonicalize an ASCII character with a non-ASCII character.


That would be my hope as well.

Norbert Lindenberg wrote:

One concern: I think code point based matching should be the default for
regex literals within modules (where we know the code is written for
Harmony). Does it make sense to also interpret \d\D\w\W\b\B as full
Unicode sets for such literals?

As I said above, yes, I think it makes sense to apply all semantics of /u bydefault within modules. Previously in this thread, I detailed what \d\w\b\smean in various regex flavors. The ones that give Unicode meanings bydefault are .NET and Perl, so ES would be in excellent regex company.Additionally, Java's \b (only) supports Unicode by default, as does ES's \s.


Norbert Lindenberg wrote:

In the other direction it's clear that using /u for \d\D\w\W\b\B has to
imply code point mode.

Not if their meaning was limited to the BMP, which is already true for\D\W\B\S. /\D\D/.test(singleAstralNondigit) == true. With code pointmatching it would be false. Yet another reason to tie the multiple proposedmeanings of /u together.


Norbert Lindenberg wrote:

Steven Levithan wrote:

3. [New proposal] Makes /i use Unicode casefolding rules.
/ΣΤΙΓΜΑΣ/iu.test("στιγμας") == true.


We probably should review the complete Unicode Technical Standard #18,
Unicode Regular Expressions, and see how we can upgrade RegExp for better
Unicode support. Maybe on a separate thread...

Agreed. You may already be thinking this, but IMO if we're going to add /uas a Little Red Switch (as David called it), the priority should be onmaking sure that /u gets all aspects of Unicode-aware regular expressionsemantics done right, before looking at new features from UTS#18 likeUnicode property matching.

-- Steven Levithan

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

Reply via email to