Re: Full Unicode based on UTF-16 proposal

Norbert Lindenberg Mon, 26 Mar 2012 20:20:25 -0700

On Mar 26, 2012, at 9:45 , Steven Levithan wrote:

> Sorry for jumping between messages...
> 
> Roger Andrews wrote:
>> Doesn't C/C++ allow non-BMP code points using \Uxxxxxxxx in character 
>> literals.  The \U format expresses a full 32-bit code, which could be mapped 
>> internally to two 16-bit UTF-16 codes.
> 
> Python uses the same syntax in regular expressions. But, as Norbert noted, 
> there is already a strawman for \u{X..}. If it were adopted, I think it is 
> clear that it should also be extended to RegExp literals. Of course, this 
> adds some complication when referencing numbers above FFFF unless /u is made 
> the default everywhere, since it implies code-point-based matching. E.g., 
> what does /[^\0-\uFFFF\u{10000}]/ without /u match?


As long as the underlying system is UTF-16 based, I'd think \u{10000} is simply 
a different notation for \uD800\uDC00. But with code unit based matching that 
will not result in the intended behavior.

> The example above also hints at additional potentially breaking changes for 
> code point matching by default that haven't yet been discussed in this 
> thread: that the meaning of negated character classes and shorthands would 
> change, and that their match length may be 2 (like the dot).

Yes.

> Roger Andrews wrote:
>> I'm struggling with how non-BMP escapes would be used in practice in strings
>> & regexps -- especially regexp ranges.  Will Strawman style be used in 
>> string & regexp literals?
>> [...examples snipped]
> 
> I'm not sure whether this was already clear, but the curly braces I included 
> in my paraphrasing of Norbert's proposed transformations were not meant to be 
> included literally. I was trying to describe ranges between arbitrary code 
> points, represented by pairs of high and low surrogates. As far as I 
> understand, no existing proposal would allow a character class range written 
> as /[{\uD834\uDF07}-{\uD834\uDF56}]/ to work correctly, with or without the 
> curly braces. To match a range outside the BMP in a literal RegExp, you would 
> have to use [<char>-<char>] (where <char> represents a literal character, and 
> this may require flag /u to work) or [\u{X..}-\u{X..}] (assuming support for 
> this syntax is added, and where X.. represents a hex number between 0 and at 
> least 10FFFF).

In my proposal the following two regular expressions are equivalent:

/[𝌆-𝍖]+/u
/[\uD834\uDF06-\uD834\uDF56]+/u

They are made equivalent by the first preprocessing step proposed for 15.10.4.1 
and the subsequent interpretation of UTF-16 sequences as code points.

I think I'd process Unicode code point escapes by first converting them to 
equivalent code unit escapes and then following the same path. This would make

/[\u{1D306}-\u{1D356}]+/u

equivalent to the two above.

> Norbert Lindenberg wrote:
>> [...snip] My first proposal was to make this happen even without a new
>> flag, i.e., make
>> "𝌆𝌇𝌈𝌉𝌊".match(/[𝌆-𝍖]+/)
>> work based on code points, and Steve is arguing against that because of
>> compatibility risk. My proposal also includes some transformations to keep
>> existing regular expressions working, and Steve correctly observes that if
>> we have a flag for code point mode, then the transformation is not needed -
>> old regular expressions would continue to work in code unit mode, while new
>> regular expressions with /u get code point treatment.
> 
> Although I've argued the compatibility risk angle, on that point I should 
> defer to implementers and others who might have a better sense of the scope 
> of risk/damage to existing programs. More personally affecting, though, is 
> the negative gut reaction I have to the well-thought-out but ugly and 
> complicated (not so much in implementation, but for devs who have to learn 
> about it) transformations that would otherwise be necessary to avoid breaking 
> current regexes. And like David, I think just requiring /u is not so bad, 
> especially since I'd want to use it for its other meanings anyway.
> 
> I'm also nervous about using different default semantics inside and out of 
> ES6 modules, but David Herman has already well articulated my concerns and 
> you've already responded, so I'll leave that discussion to you two except to 
> say that if /u is added as a general "fix Unicode" flag, IMO it's reasonable 
> to automatically turn it on in ES6 modules. That's because I think applying 
> only code unit to code point mode switching in modules by default is too 
> magical and confusing, but if it were described as turning on /u by default, 
> that's easy to understand and explain.

Good input.

> Lasse Reichstein wrote:
>> Steven Levithan wrote:
>>> I've been wondering whether it might be best for the /u flag to do three
>>> things at once, making it an all-around "support Unicode better" flag:
>>> [...]
>>> 3. [New proposal] Makes /i use Unicode casefolding rules.
>> 
>> Yey, I'm for it :)
>> Especially if it means dropping the rather naïve canonicalize function
>> that can't canonicalize an ASCII character with a non-ASCII character.
> 
> That would be my hope as well.
> 
> Norbert Lindenberg wrote:
>> One concern: I think code point based matching should be the default for
>> regex literals within modules (where we know the code is written for
>> Harmony). Does it make sense to also interpret \d\D\w\W\b\B as full
>> Unicode sets for such literals?
> 
> As I said above, yes, I think it makes sense to apply all semantics of /u by 
> default within modules. Previously in this thread, I detailed what \d\w\b\s 
> mean in various regex flavors. The ones that give Unicode meanings by default 
> are .NET and Perl, so ES would be in excellent regex company. Additionally, 
> Java's \b (only) supports Unicode by default, as does ES's \s.
> 
> Norbert Lindenberg wrote:
>> In the other direction it's clear that using /u for \d\D\w\W\b\B has to
>> imply code point mode.
> 
> Not if their meaning was limited to the BMP, which is already true for 
> \D\W\B\S. /\D\D/.test(singleAstralNondigit) == true. With code point matching 
> it would be false. Yet another reason to tie the multiple proposed meanings 
> of /u together.
> 
> Norbert Lindenberg wrote:
>> Steven Levithan wrote:
>>> 3. [New proposal] Makes /i use Unicode casefolding rules.
>>> /ΣΤΙΓΜΑΣ/iu.test("στιγμας") == true.
>> 
>> We probably should review the complete Unicode Technical Standard #18,
>> Unicode Regular Expressions, and see how we can upgrade RegExp for better
>> Unicode support. Maybe on a separate thread...
> 
> Agreed. You may already be thinking this, but IMO if we're going to add /u as 
> a Little Red Switch (as David called it), the priority should be on making 
> sure that /u gets all aspects of Unicode-aware regular expression semantics 
> done right, before looking at new features from UTS#18 like Unicode property 
> matching.
> 
> -- Steven Levithan 

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

Reply via email to