Re: Full Unicode based on UTF-16 proposal

Norbert Lindenberg Fri, 16 Mar 2012 22:18:30 -0700

On Mar 16, 2012, at 19:57 , Erik Corry wrote:

> 2012/3/17 Norbert Lindenberg <[email protected]>:
>> Thanks for your comments - a few replies below.
>> 
>> Norbert
>> 
>> 
>> On Mar 16, 2012, at 1:55 , Erik Corry wrote:
>> 
>>> However I think we probably do want the /u modifier on regexps to
>>> control the new backward-incompatible behaviour.  There may be some
>>> way to relax this for regexp literals in opted in Harmony code, but
>>> for new RegExp(...) and for other string literals I think there are
>>> rather too many inconsistencies with the old behaviour.
>> 
>> Before asking developers to add /u, we should really have some evidence that 
>> not doing so would cause actual compatibility issues for real applications. 
>> Do you know of any examples?
> 
> No.  In general I don't think it is realistic to try to prove that
> problematic code does not exist, since that requires quantifying over
> all existing JS code, which is clearly impossible.


We cannot prove its absence, but we can discuss the likelihood of its 
existence, and showing an actual example is a quick way to bring that 
discussion to a conclusion.

I note that you didn't challenge my claim about the (un)likelihood of the 
existence of applications that depend on Deseret characters not being mapped to 
lower case while calling toLowerCase...

>>> The algorithm given for codePointAt never returns NaN.  It should
>>> probably do that for indices that hit a trail surrogate that has a
>>> lead surrogate preceeding it.
>> 
>> NaN is not a valid code point, so it shouldn't be returned. If we want to 
>> indicate access to a trailing surrogate code unit as an error, we should 
>> throw an exception.
> 
> Then you should probably remove the text: "If there is no code unit at
> that position, the result is NaN" from your proposal :-)
> 
> I am wary of using exceptions for non-exceptional data-driven events,
> since performance is usually terrible and it's arguably an abuse of
> the mechanism.  Your iterator code looks fine to me an needs neither
> NaN or exceptions.

The iterator or codePointAt?

The latter has the statement you quote, which shows a disconnect between what I 
wrote a few days ago starting from the charCodeAt spec, and what I think when I 
don't look at that spec. charCodeAt (and hence the current implementation of 
codePointAt) returns NaN when given an index < 0 or ≥ length. The normal 
behavior when accessing elements or properties that don't exist is to return 
undefined. We can't fix charCodeAt anymore, but I can still fix codePointAt.

>>> Perhaps it is outside the scope of this proposal, but it would also
>>> make a lot of sense to add some named character classes to RegExp.
>> 
>> It would make a lot of sense, but is outside the scope of this proposal. One 
>> step at a time :-)
> 
> I can see that.  But if we are going to have multiple versions of the
> RegExp syntax we should probably aim to keep the number down.

True. And in the meantime Brendan pointed to some regex proposals that try to 
address a different set of Unicode-related issues, also with a /u flag. Some 
coordination is clearly needed.
http://blog.stevenlevithan.com/archives/fixing-javascript-regexp


_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

Reply via email to