Re: Q: Lonely surrogates and unicode regexps

Allen Wirfs-Brock Wed, 28 Jan 2015 08:11:33 -0800

On Jan 28, 2015, at 2:36 AM, Marja Hölttä <ma...@chromium.org> wrote:


> Hello es-discuss,
> 
> TL;DR: /foo.bar/u.test(“foo\uD83Dbar”) == ?
> 
> The ES6 unicode regexp spec is not very clear regarding what should happen if 
> the regexp or the matched string contains lonely surrogates (a lead surrogate 
> without a trail, or a trail without a lead). For example, for the . operator, 
> the relevant parts of the spec speak about characters:

TL;DR: in a unicode regexp lonely surrogates are considered to be a single 
“character”. 

As André has already covered “character” has a very specific meaning within the 
context of the ES6 RegExp specification in the second paragraph of  
http://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics . 
The specification uses the same set of algorithms to describe both BCP (i.e., 
16-bit elements) and unicode (i.e., 32-bit elements) patterns and matching 
semantics.  “Character” is used in those algorithm to refer to a single element 
of the mode that is currently operating within.

I think the ambiguity you find is in step 2.1 of 
http://people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern :

2.  Return an internal closure that takes two arguments, a String str and an 
integer index, and performs the following:    
1. If Unicode is true, let Input be a List consisting of the sequence of code 
points of str interpreted as a UTF-16 encoded Unicode string. Otherwise, let 
Input be a List consisting of the sequence of code units that are the elements 
of str. Input will be used throughout the algorithms in 21.2.2. Each element of 
Input is considered to be a character.         

Apparently I don’t have an adequate definition of “interpreted as a UTF-16 
encoded Unicode string”. If you submit a bug to bugs.emncascript.org) I will 
provided one in the next spec. revisions.  The intended semantics is that:
   In ascending string index order:
        Each valid UTF-16 surrogate pair is interpreted as a signal code point 
that is the UTF-16 encoded value
        Each “lonely” surrogate is interpreted as  single code point that is 
the surrogate value
        Every other 16-bit code unit is interpreted as a single code point.

Allen






> 
> https://people.mozilla.org/~jorendorff/es6-draft.html#sec-atom
> https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-charactersetmatcher-abstract-operation
> https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-canonicalize-abstract-operation
> 
> E.g.,
> “Let A be the set of all *characters* except LineTerminator.”
> “Let ch be the *character* Input[e].”
> 
> But is a lonely surrogate a character? According to the Unicode standard, 
> it’s not. If it's not, what will ch be if the input string contains a lonely 
> surrogate in the relevant position?
> 
> Q1: Are lonely surrogates allowed in /u regexps?
> 
> E.g., /foo\uD83D/u; (note lonely lead surrogate), should this be allowed? 
> Will it match a lead surrogate inside a surrogate pair?
> 
> Suggestion: we shouldn't allow lonely surrogates in /u regexps.
> 
> If users actually want to match lonely surrogates (e.g., to check for them or 
> remove them) then they can use non-/u regexps.
> 
> The regexp syntax treats a lonely surrogate as a normal unicode escape, and 
> the rules say e.g., "The production RegExpUnicodeEscapeSequence :: u 
> Hex4Digits evaluates as follows: Return the character whose code is the SV of 
> Hex4Digits." - it's also unclear what this means if no valid character has 
> this code.
> 
> Q2: If the string contains a lonely surrogate, what should it match? Should 
> it match .? Should it match [^a] ? (Or is it undefined behavior?)
> 
> Test cases:
> /foo.bar/u.test("foo\uD83Dbar") == ?
> /foo.bar/u.test("foo\uDC00bar") == ?
> /foo[^a]bar/u.test("foo\uD83Dbar") == ?
> /foo[^a]bar/u.test("foo\uDC00bar") == ?
> /foo/u.test("bar\uD83Dbarfoo") == ?
> /foo/u.test("bar\uDC00barfoo") == ?
> /foo(.*)bar\1/u.test("foo\uD834bar\uD834\uDC00") == ? // Should the 
> backreference be allowed to match the lead surrogate of a surrogate pair?
> /^(.+)\1$/u.test("\uDC00foobar\uD83D\uDC00foobar\uD83D") == ?? // Should we 
> allow splitting the surrogate pair like this?
> 
> Suggestion: a lonely surrogate should not be a character and it should not 
> match . or [^a] etc. However, a lonely surrogate in the input string 
> shouldn't prevent some other part of the string from matching.
> 
> If a lonely surrogate is treated as a character, the matching rule for . gets 
> complicated and difficult / slow to implement: . should not match individual 
> surrogates inside a surrogate pair, but if it has to match a lonely 
> surrogate, we'll end up needing lookahead and lookbehind logic to implement 
> that behavior.
> 
> For example, the current version of Mathias’s ES6 Unicode regular expression 
> transpiler ( https://mothereff.in/regexpu ) converts /a.b/u into 
> /a(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b/
>  and afaics it’s not yet fully consistent wrt lonely surrogates, so, a 
> consistent implementation is going to be more complex than this.
> 
> If we convert the string into UC-32 before matching, then the "lonely 
> surrogate is a character" behavior gets easier to implement, but we wouldn't 
> want to be forced to do that. The intention behind the ES6 spec seems to be 
> that strings can / should still be stored as UC-16. Converting strings to 
> UC-32 before matching with /u regexps would require an additional pass over 
> the string which we'd want to avoid, and converting only when strictly needed 
> for the "lonely surrogate is a character" implementation adds complexity. 
> E.g., with some regexps we don't need to scan the whole input string to find 
> a match, and also most input strings, even for /u regexps, probably won't 
> contain surrogates (to find that out we'd also need to scan the whole string, 
> or some logic to fall back to UC-32 matching when we see a surrogate).
> 
> BR,
> Marja
> 
> _______________________________________________
> es-discuss mailing list
> es-discuss@mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss

_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Q: Lonely surrogates and unicode regexps

Reply via email to