Re: Full Unicode based on UTF-16 proposal

Erik Corry Mon, 26 Mar 2012 14:13:09 -0700

2012/3/26 Gavin Barraclough <[email protected]>:
> Hi Norbert,
>
> I really like the direction you're going in, but have one minor concern 
> relating to regular expressions.
>
> In your proposal, you currently state:
>        "A code unit that is in the range 0xD800 to 0xDFFF, but is not part of 
> a surrogate pair, is interpreted as a code point with the same value."
>
> I think this makes sense in the context of your original proposal, which 
> seeks to be backwards compatible with existing regular expressions through 
> the range transformations.  But I'm concerned that this might prove 
> problematic, and would suggest that if we're going to make unicode regexp 
> match opt-in through a /u flag then instead it may be better to make unpaired 
> surrogates in unicode expressions a syntax error.
>
> My concern would be expressions such as:
>        /[\uD800\uDC00\uDC00\uD800]/u
> Under my reading of the current proposal, this could match any of 
> "\uD800\uDC00", "\uD800", or "\uDC00".  Allowing this seems to introduce the 
> concept of precedence to character classes (given an input "\uD800\uDC00", 
> should I choose to match "\uD800\uDC00" or "\uD800"?).  It may also 
> significantly complicate the implementation of backtracking if we were to 
> allow this (if I have matched "\uD800\uDC00", should I step back by one code 
> unit or two?).
>
> It also just seems much clearer from a user perspective to say that 
> non-unicode regular expressions match code units, unicode regular expressions 
> match code points - mixing the two seems unhelpful.
>
> If opt-in is automatic in modules, programmers will likely want an escape to 
> be able to write non-unicode regular expressions, but I don't think this 
> should warrant an extra flag, I don't think we can automatically change the 
> behaviour of the RegExp constructor (without a "u" flag being passed), so 
> RegExp("\uD800") should still be available to support non-unicode matching 
> within modules.


This is too nasty.   The regexp constructor should not have to look up
the stack to see what behaviour is expected of it.

-- 
Erik Corry

>
> cheers,
> G.
>
>
> On Mar 16, 2012, at 12:18 AM, Norbert Lindenberg wrote:
>
>> Based on my prioritization of goals for support for full Unicode in 
>> ECMAScript [1], I've put together a proposal for supporting the full Unicode 
>> character set based on the existing representation of text in ECMAScript 
>> using UTF-16 code unit sequences:
>> http://norbertlindenberg.com/2012/03/ecmascript-supplementary-characters/index.html
>>
>> The detailed proposed spec changes serve to get a good idea of the scope of 
>> the changes, but will need some polishing.
>>
>> Comments?
>>
>> Thanks,
>> Norbert
>>
>> [1] https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html
>>
>> _______________________________________________
>> es-discuss mailing list
>> [email protected]
>> https://mail.mozilla.org/listinfo/es-discuss
>
> _______________________________________________
> es-discuss mailing list
> [email protected]
> https://mail.mozilla.org/listinfo/es-discuss
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

Reply via email to