Re: Full Unicode based on UTF-16 proposal

Norbert Lindenberg Mon, 26 Mar 2012 21:13:00 -0700

On Mar 26, 2012, at 13:02 , Gavin Barraclough wrote:

> Hi Norbert,
> 
> I really like the direction you're going in, but have one minor concern 
> relating to regular expressions.
> 
> In your proposal, you currently state:
>       "A code unit that is in the range 0xD800 to 0xDFFF, but is not part of 
> a surrogate pair, is interpreted as a code point with the same value."
> 
> I think this makes sense in the context of your original proposal, which 
> seeks to be backwards compatible with existing regular expressions through 
> the range transformations.  But I'm concerned that this might prove 
> problematic, and would suggest that if we're going to make unicode regexp 
> match opt-in through a /u flag then instead it may be better to make unpaired 
> surrogates in unicode expressions a syntax error.


That's worth considering. It seems we're more and more moving towards two 
separate RegExp versions anyway - a legacy version based on code units and with 
all kinds of quirks, and an all-around-better version based on code points. It 
means however that you can't easily remove unpaired surrogates by
   str.replace(/[\u{D800}-\u{DFFF}]/ug, "\u{FFFD}")

> My concern would be expressions such as:
>       /[\uD800\uDC00\uDC00\uD800]/u
> Under my reading of the current proposal, this could match any of 
> "\uD800\uDC00", "\uD800", or "\uDC00".  Allowing this seems to introduce the 
> concept of precedence to character classes (given an input "\uD800\uDC00", 
> should I choose to match "\uD800\uDC00" or "\uD800"?).  It may also 
> significantly complicate the implementation of backtracking if we were to 
> allow this (if I have matched "\uD800\uDC00", should I step back by one code 
> unit or two?).

I think/hope that my specification is clear: a surrogate pair is always treated 
as one entity, not as two pieces. If the input is "\uD800\uDC00", you match 
"\uD800\uDC00". If you have to backtrack over "\uD800\uDC00", you step back two 
code units.

> It also just seems much clearer from a user perspective to say that 
> non-unicode regular expressions match code units, unicode regular expressions 
> match code points - mixing the two seems unhelpful.
> 
> If opt-in is automatic in modules, programmers will likely want an escape to 
> be able to write non-unicode regular expressions, but I don't think this 
> should warrant an extra flag, I don't think we can automatically change the 
> behaviour of the RegExp constructor (without a "u" flag being passed), so 
> RegExp("\uD800") should still be available to support non-unicode matching 
> within modules.

Agreed, especially after reading Erik's and your additional emails on this.

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

Reply via email to