> On Jan 28, 2015, at 8:11 , Allen Wirfs-Brock <[email protected]> wrote: > > > On Jan 28, 2015, at 2:36 AM, Marja Hölttä <[email protected]> wrote: > >> Hello es-discuss, >> >> TL;DR: /foo.bar/u.test(“foo\uD83Dbar”) == ? >> >> The ES6 unicode regexp spec is not very clear regarding what should happen >> if the regexp or the matched string contains lonely surrogates (a lead >> surrogate without a trail, or a trail without a lead). For example, for the >> . operator, the relevant parts of the spec speak about characters: > > TL;DR: in a unicode regexp lonely surrogates are considered to be a single > “character”. > > As André has already covered “character” has a very specific meaning within > the context of the ES6 RegExp specification in the second paragraph of > http://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics . > The specification uses the same set of algorithms to describe both BCP (i.e., > 16-bit elements) and unicode (i.e., 32-bit elements) patterns and matching > semantics. “Character” is used in those algorithm to refer to a single > element of the mode that is currently operating within. > > I think the ambiguity you find is in step 2.1 of > http://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern : > > 2. Return an internal closure that takes two arguments, a String str and an > integer index, and performs the following: > 1. If Unicode is true, let Input be a List consisting of the sequence of code > points of str interpreted as a UTF-16 encoded Unicode string. Otherwise, let > Input be a List consisting of the sequence of code units that are the > elements of str. Input will be used throughout the algorithms in 21.2.2. Each > element of Input is considered to be a character. > > Apparently I don’t have an adequate definition of “interpreted as a UTF-16 > encoded Unicode string”. If you submit a bug to bugs.emncascript.org) I will > provided one in the next spec. revisions. The intended semantics is that: > In ascending string index order: > Each valid UTF-16 surrogate pair is interpreted as a signal code point > that is the UTF-16 encoded value > Each “lonely” surrogate is interpreted as single code point that is > the surrogate value > Every other 16-bit code unit is interpreted as a single code point.
That definition is in section 6.1.4: http://people.mozilla.org/~jorendorff/es6-draft.html#sec-ecmascript-language-types-string-type A cross-reference would be useful. Norbert _______________________________________________ es-discuss mailing list [email protected] https://mail.mozilla.org/listinfo/es-discuss

