> On 28 Jan 2015, at 11:36, Marja Hölttä <ma...@chromium.org> wrote:
> 
> TL;DR: /foo.bar/u.test(“foo\uD83Dbar”) == ?
> 
> The ES6 unicode regexp spec is not very clear regarding what should happen if 
> the regexp or the matched string contains lonely surrogates (a lead surrogate 
> without a trail, or a trail without a lead). For example, for the . operator, 
> the relevant parts of the spec speak about characters:
> 
> https://people.mozilla.org/~jorendorff/es6-draft.html#sec-atom
> https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-charactersetmatcher-abstract-operation
> https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-canonicalize-abstract-operation
> 
> E.g.,
> “Let A be the set of all *characters* except LineTerminator.”
> “Let ch be the *character* Input[e].”
> 
> But is a lonely surrogate a character? According to the Unicode standard, 
> it’s not. If it's not, what will ch be if the input string contains a lonely 
> surrogate in the relevant position?
> 
> Q1: Are lonely surrogates allowed in /u regexps?
> 
> E.g., /foo\uD83D/u; (note lonely lead surrogate), should this be allowed? 
> Will it match a lead surrogate inside a surrogate pair?
> 
> Suggestion: we shouldn't allow lonely surrogates in /u regexps.
> 
> If users actually want to match lonely surrogates (e.g., to check for them or 
> remove them) then they can use non-/u regexps.

You’re proposing to define “characters” in terms of Unicode scalar values in 
the case `/u` is used. I could get behind that — it reinforces the idea that 
`/u` is like a strict mode for regular expressions.

Playing devil’s advocate, the problem is that regular expressions and strings 
go hand in hand, and there is no guarantee that JavaScript strings only consist 
of valid code points. Making `.` not match lone surrogates breaks the developer 
expectation that `(.)` matches every “part” of the string. Having to avoid `/u` 
to prevent this seems like a potentially bad thing.

> The regexp syntax treats a lonely surrogate as a normal unicode escape, and 
> the rules say e.g., "The production RegExpUnicodeEscapeSequence :: u 
> Hex4Digits evaluates as follows: Return the character whose code is the SV of 
> Hex4Digits." - it's also unclear what this means if no valid character has 
> this code.
> 
> Q2: If the string contains a lonely surrogate, what should it match? Should 
> it match .? Should it match [^a] ? (Or is it undefined behavior?)
> 
> Test cases:
> /foo.bar/u.test("foo\uD83Dbar") == ?
> /foo.bar/u.test("foo\uDC00bar") == ?
> /foo[^a]bar/u.test("foo\uD83Dbar") == ?
> /foo[^a]bar/u.test("foo\uDC00bar") == ?
> /foo/u.test("bar\uD83Dbarfoo") == ?
> /foo/u.test("bar\uDC00barfoo") == ?
> /foo(.*)bar\1/u.test("foo\uD834bar\uD834\uDC00") == ? // Should the 
> backreference be allowed to match the lead surrogate of a surrogate pair?
> /^(.+)\1$/u.test("\uDC00foobar\uD83D\uDC00foobar\uD83D") == ?? // Should we 
> allow splitting the surrogate pair like this?
> 
> Suggestion: a lonely surrogate should not be a character and it should not 
> match . or [^a] etc. However, a lonely surrogate in the input string 
> shouldn't prevent some other part of the string from matching.
> 
> If a lonely surrogate is treated as a character, the matching rule for . gets 
> complicated and difficult / slow to implement: . should not match individual 
> surrogates inside a surrogate pair, but if it has to match a lonely 
> surrogate, we'll end up needing lookahead and lookbehind logic to implement 
> that behavior.
> 
> For example, the current version of Mathias’s ES6 Unicode regular expression 
> transpiler ( https://mothereff.in/regexpu ) converts /a.b/u into 
> /a(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b/
>  and afaics it’s not yet fully consistent wrt lonely surrogates, so, a 
> consistent implementation is going to be more complex than this.

This is indeed an incomplete solution. The lack of lookbehind support in ES 
makes this hard to transpile correctly. Ideas welcome!

_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Reply via email to