Re: Q: Lonely surrogates and unicode regexps

Wes Garland Wed, 28 Jan 2015 04:56:00 -0800

Some interesting questions here.

1 - What is a character? Is it a Unicode Code Point?
2 - Should we be able to match all possible JS Strings?
3 - Should we be able to match all possible Unicode Strings?
4 - What do we do if there is a character in a String we cannot match?
5 - Do unmatchable characters match . ?
6 - Are subsections of unmatchable strings matchable if they contain only
matchable characters?


It is important to remember in these discussions that the Unicode
specification allows strings which contain unmatched surrogates. Do we want
regular expressions that can't match some Unicode strings? Do we extend the
regexp syntax to have a symbol which matches an unmatched surrogate?  How
about reserved code points?  What happens when they become assigned?


On 28 January 2015 at 05:36, Marja Hölttä <[email protected]> wrote:

> Hello es-discuss,
>
> TL;DR: /foo.bar/u.test(“foo\uD83Dbar”) == ?
>
> The ES6 unicode regexp spec is not very clear regarding what should happen
> if the regexp or the matched string contains lonely surrogates (a lead
> surrogate without a trail, or a trail without a lead). For example, for the
> . operator, the relevant parts of the spec speak about characters:
>
> https://people.mozilla.org/~jorendorff/es6-draft.html#sec-atom
>
> https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-charactersetmatcher-abstract-operation
>
> https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-canonicalize-abstract-operation
>
> E.g.,
> “Let A be the set of all *characters* except LineTerminator.”
> “Let ch be the *character* Input[e].”
>
> But is a lonely surrogate a character? According to the Unicode standard,
> it’s not. If it's not, what will ch be if the input string contains a
> lonely surrogate in the relevant position?
>
> Q1: Are lonely surrogates allowed in /u regexps?
>
> E.g., /foo\uD83D/u; (note lonely lead surrogate), should this be allowed?
> Will it match a lead surrogate inside a surrogate pair?
>
> Suggestion: we shouldn't allow lonely surrogates in /u regexps.
>
> If users actually want to match lonely surrogates (e.g., to check for them
> or remove them) then they can use non-/u regexps.
>
> The regexp syntax treats a lonely surrogate as a normal unicode escape,
> and the rules say e.g., "The production RegExpUnicodeEscapeSequence :: u
> Hex4Digits evaluates as follows: Return the character whose code is the SV
> of Hex4Digits." - it's also unclear what this means if no valid character
> has this code.
>
> Q2: If the string contains a lonely surrogate, what should it match?
> Should it match .? Should it match [^a] ? (Or is it undefined behavior?)
>
> Test cases:
> /foo.bar/u.test("foo\uD83Dbar") == ?
> /foo.bar/u.test("foo\uDC00bar") == ?
> /foo[^a]bar/u.test("foo\uD83Dbar") == ?
> /foo[^a]bar/u.test("foo\uDC00bar") == ?
> /foo/u.test("bar\uD83Dbarfoo") == ?
> /foo/u.test("bar\uDC00barfoo") == ?
> /foo(.*)bar\1/u.test("foo\uD834bar\uD834\uDC00") == ? // Should the
> backreference be allowed to match the lead surrogate of a surrogate pair?
> /^(.+)\1$/u.test("\uDC00foobar\uD83D\uDC00foobar\uD83D") == ?? // Should
> we allow splitting the surrogate pair like this?
>
> Suggestion: a lonely surrogate should not be a character and it should not
> match . or [^a] etc. However, a lonely surrogate in the input string
> shouldn't prevent some other part of the string from matching.
>
> If a lonely surrogate is treated as a character, the matching rule for .
> gets complicated and difficult / slow to implement: . should not match
> individual surrogates inside a surrogate pair, but if it has to match a
> lonely surrogate, we'll end up needing lookahead and lookbehind logic to
> implement that behavior.
>
> For example, the current version of Mathias’s ES6 Unicode regular
> expression transpiler ( https://mothereff.in/regexpu ) converts /a.b/u
> into
> /a(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b/
> and afaics it’s not yet fully consistent wrt lonely surrogates, so, a
> consistent implementation is going to be more complex than this.
>
> If we convert the string into UC-32 before matching, then the "lonely
> surrogate is a character" behavior gets easier to implement, but we
> wouldn't want to be forced to do that. The intention behind the ES6 spec
> seems to be that strings can / should still be stored as UC-16. Converting
> strings to UC-32 before matching with /u regexps would require an
> additional pass over the string which we'd want to avoid, and converting
> only when strictly needed for the "lonely surrogate is a character"
> implementation adds complexity. E.g., with some regexps we don't need to
> scan the whole input string to find a match, and also most input strings,
> even for /u regexps, probably won't contain surrogates (to find that out
> we'd also need to scan the whole string, or some logic to fall back to
> UC-32 matching when we see a surrogate).
>
> BR,
> Marja
>
>
> _______________________________________________
> es-discuss mailing list
> [email protected]
> https://mail.mozilla.org/listinfo/es-discuss
>
>


-- 
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: Q: Lonely surrogates and unicode regexps

Reply via email to