On Wed, Jan 28, 2015 at 11:45 AM, Mathias Bynens <math...@qiwi.be> wrote:
> > > On 28 Jan 2015, at 11:36, Marja Hölttä <ma...@chromium.org> wrote: > > > > For example, the current version of Mathias’s ES6 Unicode regular > expression transpiler ( https://mothereff.in/regexpu ) converts /a.b/u > into > /a(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b/ > and afaics it’s not yet fully consistent wrt lonely surrogates, so, a > consistent implementation is going to be more complex than this. > > This is indeed an incomplete solution. The lack of lookbehind support in > ES makes this hard to transpile correctly. Ideas welcome! > I don't think your transpiler can work without lookbehind. If you could guarantee that none of your transpiled regexp matches a substring that ends in the middle of a pair, then I think you could get it right without lookbehind, but consider: "TxL-TxLT".test(/(...)-\1./); Where L stands for a lead surrogate, and T stands for a trailing surrogate. There's no way to stop the backreference from swallowing the last L, and without lookbehind there is no way to stop the . from matching the final T. A second issue is having a match that starts in the middle of a pair. You could test for this after the matching if JS gave you the index of the match in the string, but I don't think it does. Ignoring the start-of-match-in-the-middle-of-a-pair issue, and the backreferences case, I think you can do without the backreference. Assuming the lonely-surrogates-are-a-character scenario, the period (.) transpiles to (ignore spaces added for readability): (?: \L(?!\T) | \L\T | \T | [^\L\T\N]) where \L means leading surrogates, \T means trailing surrogates, \N means all newlines. Whatever comes before the . is not allowed to match a half As an optimization, .x can transpile to (?: \L\T | . )x where the x stands in for any literal characters. For a JS engine implementor, like Marja, it is of course possible to add 1-character negative lookbehind (\b already has elements of this). Then your in-engine transpiler turns . into (?: \L(?!\T) | \L\T | (?<!\L)\T | [^\L\T\N]) Which is going to be truly horrible in terms of code size and performance. It's not like the period operator is a rare thing in a regexp, and other common things like [^a-z] and [^\d] will expand into similar horrors. On the other hand, in the lonely-surrogates-match-nothing scenario, the . transpiles to (?: \l\t | [^\l\t\n] ) which is quite a lot nicer and faster. In this scenario, .x expands to (?: \L\T | [^\T\L\N ) which still has no lookaheads and lookbehinds.
_______________________________________________ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss