Re: Q: Lonely surrogates and unicode regexps

Norbert Lindenberg Sat, 31 Jan 2015 00:02:05 -0800

> On Jan 28, 2015, at 8:11 , Allen Wirfs-Brock <[email protected]> wrote:
> 
> 
> On Jan 28, 2015, at 2:36 AM, Marja Hölttä <[email protected]> wrote:
> 
>> Hello es-discuss,
>> 
>> TL;DR: /foo.bar/u.test(“foo\uD83Dbar”) == ?
>> 
>> The ES6 unicode regexp spec is not very clear regarding what should happen 
>> if the regexp or the matched string contains lonely surrogates (a lead 
>> surrogate without a trail, or a trail without a lead). For example, for the 
>> . operator, the relevant parts of the spec speak about characters:
> 
> TL;DR: in a unicode regexp lonely surrogates are considered to be a single 
> “character”. 
> 
> As André has already covered “character” has a very specific meaning within 
> the context of the ES6 RegExp specification in the second paragraph of  
> http://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics . 
> The specification uses the same set of algorithms to describe both BCP (i.e., 
> 16-bit elements) and unicode (i.e., 32-bit elements) patterns and matching 
> semantics.  “Character” is used in those algorithm to refer to a single 
> element of the mode that is currently operating within.
> 
> I think the ambiguity you find is in step 2.1 of 
> http://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern :
> 
> 2.  Return an internal closure that takes two arguments, a String str and an 
> integer index, and performs the following:    
> 1. If Unicode is true, let Input be a List consisting of the sequence of code 
> points of str interpreted as a UTF-16 encoded Unicode string. Otherwise, let 
> Input be a List consisting of the sequence of code units that are the 
> elements of str. Input will be used throughout the algorithms in 21.2.2. Each 
> element of Input is considered to be a character.         
> 
> Apparently I don’t have an adequate definition of “interpreted as a UTF-16 
> encoded Unicode string”. If you submit a bug to bugs.emncascript.org) I will 
> provided one in the next spec. revisions.  The intended semantics is that:
>    In ascending string index order:
>       Each valid UTF-16 surrogate pair is interpreted as a signal code point 
> that is the UTF-16 encoded value
>         Each “lonely” surrogate is interpreted as  single code point that is 
> the surrogate value
>         Every other 16-bit code unit is interpreted as a single code point.


That definition is in section 6.1.4:
http://people.mozilla.org/~jorendorff/es6-draft.html#sec-ecmascript-language-types-string-type

A cross-reference would be useful.

Norbert
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Q: Lonely surrogates and unicode regexps

Reply via email to