Erik Corry wrote:
Steven Levithan wrote:
In practice, at least for [Java-style finite-length] lookbehind, this
attempt to avoid far-back searches is kind of silly--e.g., Java lets you use
a quantifier like {0,100000} within lookbehind.

At least this provides something to point the user at and say "look,
this is why you have bad performance".

Fair point.

Erik Corry wrote:
Really?  I don't have a .NET implementation handy, but I'll make a
prediction based on the description of its algorithm.  Lets look at
the following.  (replace the '+'s with {1,1000} for the sake of the
non-.NET regexps, I'm not going to type that ugliness here):

/(x(.+?))$/
/(?<(x(.+?))$/

Give these regexps the input: "x foo x bar".  The first should match
"x foo x bar" and the second will match "x bar" as the non-greedy
quantifier will stop when it finds the nearest x from the end.  But in
a regexp engine with the "start at the earliest point and search
forwards" kludge they will return the same.

You're right for .NET. But I must add: I wasn't really considering the nongreedy case (which should have been obvious!) in any of my previous descriptions of lookbehind behavior or developer expectations for it. Boo, me! :-( I've just done some testing that also shows I've been wrong/oversimplifying/overgeneralizing in my descriptions of lookbehind behavior and implementation across multiple regex flavors (even in the greedy case).

Let me give some real results. I'll leave out comparisons with /(nonlookbehind)$/ since I think everyone interested in this discussion knows what those results should be.

/(?<=(\d+))$/ with "123":
   * .NET: $1 == "123".
   * Java: [Unsupported]
   * PCRE: [Unsupported]

/(?<=(\d{1,3}))$/ with "123":
   * .NET: $1 == "123".
   * Java: $1 == "3".
   * PCRE: [Unsupported]

/(?<=(\d{1,3}?))$/ with "123":
   * .NET: $1 == "3".
   * Java: $1 == "3".
   * PCRE: [Unsupported]

/(?<=(\d{1,3})(\d{1,3}?))$/ with "123":
   * .NET: $1 == "12", $2 == "3".
   * Java: $1 == "2", $2 == "3".
   * PCRE: [Unsupported]

/(?<=(\d{1,3}?)(\d{1,3}))$/ with "123":
   * .NET: $1 == "1", $2 == "23".
   * Java: $1 == "2", $2 == "3".
   * PCRE: [Unsupported]

/(?<=(\d)|(\d\d)|(\d\d\d))$/ with "123":
   * .NET: $1 == "3". [$2 and $3 don't participate.]
   * Java: $1 == "3". [$2 and $3 don't participate.]
   * PCRE: $1 == "3". [$2 and $3 don't participate.]

/(?<=(\d\d\d)|(\d\d)|(\d))$/ with "123":
   * .NET: $1 == "123". [$2 and $3 don't participate.]
   * Java: $3 == "3". [$1 and $2 don't participate.]
   * PCRE: $1 == "123". [$2 and $3 don't participate.]

Notes:

* There is more to test (e.g., nested lookbehind, supported by .NET, Java, and PCRE), but this should demonstrate the basic semantics used by the three major implementations that support capturing within variable-length lookbehind.

* Oniguruma (used by default in Ruby 1.9) supports /(?<=a|bc)/ but not /(?<=ab?)/ or /(?<=a(?:b|cd))/. It also doesn't support capturing groups in lookbehind.

* Perl, Python, and Boost.Regex support fixed-length lookbehind only: (?<=abc) or (?<=a|b).

* Ruby 1.8, Tcl, XML Schema/XPath, RE2, and ERE/BRE do not support lookbehind at all.

* I haven't tested ICU Regular Expressions, but their docs say they support finite-length lookbehind (without specifics).

IMO, .NET's lookbehind is the flagship/ideal, both in its support for all regex syntax and in it's greedy/nongreedy behavior.

This is probably as good a place as any to note that, should the committee feel compelled to reconsider (or augment) lookbehind support, there is another simpler and more efficient idea that works similar to lookbehind at the start of a pattern: \K ("keep"). \K was invented by Jeff Pinyan (see [1]) and made it's way into Perl 5.10, PCRE 7.2, and Boost.Regex (ver?).

\K effectively resets the match start wherever it is encountered, causing any previously matched characters not to be included in the final match (e.g., /foo\Kbar/ matches "foobar", but reports that it matched "bar"). This avoids lookbehind-esque concerns about infinite/finite/fixed length. Also note that \K doesn't interfere with the setting of backreferences (e.g., /(foo)\Kbar/ sets $1 to "foo"), and can be used within alternatives (e.g., /foo|bar\Kbaz/ matches "foo" or "barbaz", but reports matching "foo" or "baz") or repeated groups, etc.

\K has a couple issues of its own, though:

* Perl docs say \K's behavior in lookaround is "not well defined". In PCRE, \K is acted upon within positive lookaround but ignored in negative lookaround.

* \K in ES would open the possibility of zero-length matches that change the value of RegExp.prototype.lastIndex. This isn't a problem on its own, but it can effect how accompanying code must be written. It could also break code unwise to \K that accepts arbitrary regexes from an outside source.

-- Steven Levithan

[1]: http://search.cpan.org/~pinyan/Regexp-Keep-0.02/Keep.pm


_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Reply via email to