Re: RegExp lookbehind

Steven L. Sun, 18 Mar 2012 11:04:50 -0700

Erik Corry wrote:

Steven Levithan wrote:

In practice, at least for [Java-style finite-length] lookbehind, this
attempt to avoid far-back searches is kind of silly--e.g., Java lets youuse
a quantifier like {0,100000} within lookbehind.


At least this provides something to point the user at and say "look,
this is why you have bad performance".


Fair point.

Erik Corry wrote:

Really?  I don't have a .NET implementation handy, but I'll make a
prediction based on the description of its algorithm.  Lets look at
the following.  (replace the '+'s with {1,1000} for the sake of the
non-.NET regexps, I'm not going to type that ugliness here):

/(x(.+?))$/
/(?<(x(.+?))$/

Give these regexps the input: "x foo x bar".  The first should match
"x foo x bar" and the second will match "x bar" as the non-greedy
quantifier will stop when it finds the nearest x from the end.  But in
a regexp engine with the "start at the earliest point and search
forwards" kludge they will return the same.

You're right for .NET. But I must add: I wasn't really considering thenongreedy case (which should have been obvious!) in any of my previousdescriptions of lookbehind behavior or developer expectations for it. Boo,me! :-( I've just done some testing that also shows I've beenwrong/oversimplifying/overgeneralizing in my descriptions of lookbehindbehavior and implementation across multiple regex flavors (even in thegreedy case).

Let me give some real results. I'll leave out comparisons with/(nonlookbehind)$/ since I think everyone interested in this discussionknows what those results should be.


/(?<=(\d+))$/ with "123":
   * .NET: $1 == "123".
   * Java: [Unsupported]
   * PCRE: [Unsupported]

/(?<=(\d{1,3}))$/ with "123":
   * .NET: $1 == "123".
   * Java: $1 == "3".
   * PCRE: [Unsupported]

/(?<=(\d{1,3}?))$/ with "123":
   * .NET: $1 == "3".
   * Java: $1 == "3".
   * PCRE: [Unsupported]

/(?<=(\d{1,3})(\d{1,3}?))$/ with "123":
   * .NET: $1 == "12", $2 == "3".
   * Java: $1 == "2", $2 == "3".
   * PCRE: [Unsupported]

/(?<=(\d{1,3}?)(\d{1,3}))$/ with "123":
   * .NET: $1 == "1", $2 == "23".
   * Java: $1 == "2", $2 == "3".
   * PCRE: [Unsupported]

/(?<=(\d)|(\d\d)|(\d\d\d))$/ with "123":
   * .NET: $1 == "3". [$2 and $3 don't participate.]
   * Java: $1 == "3". [$2 and $3 don't participate.]
   * PCRE: $1 == "3". [$2 and $3 don't participate.]

/(?<=(\d\d\d)|(\d\d)|(\d))$/ with "123":
   * .NET: $1 == "123". [$2 and $3 don't participate.]
   * Java: $3 == "3". [$1 and $2 don't participate.]
   * PCRE: $1 == "123". [$2 and $3 don't participate.]

Notes:

* There is more to test (e.g., nested lookbehind, supported by .NET, Java,and PCRE), but this should demonstrate the basic semantics used by the threemajor implementations that support capturing within variable-lengthlookbehind.

* Oniguruma (used by default in Ruby 1.9) supports /(?<=a|bc)/ but not/(?<=ab?)/ or /(?<=a(?:b|cd))/. It also doesn't support capturing groups inlookbehind.

* Perl, Python, and Boost.Regex support fixed-length lookbehind only:(?<=abc) or (?<=a|b).

* Ruby 1.8, Tcl, XML Schema/XPath, RE2, and ERE/BRE do not supportlookbehind at all.

* I haven't tested ICU Regular Expressions, but their docs say they supportfinite-length lookbehind (without specifics).

IMO, .NET's lookbehind is the flagship/ideal, both in its support for allregex syntax and in it's greedy/nongreedy behavior.

This is probably as good a place as any to note that, should the committeefeel compelled to reconsider (or augment) lookbehind support, there isanother simpler and more efficient idea that works similar to lookbehind atthe start of a pattern: \K ("keep"). \K was invented by Jeff Pinyan (see[1]) and made it's way into Perl 5.10, PCRE 7.2, and Boost.Regex (ver?).

\K effectively resets the match start wherever it is encountered, causingany previously matched characters not to be included in the final match(e.g., /foo\Kbar/ matches "foobar", but reports that it matched "bar"). Thisavoids lookbehind-esque concerns about infinite/finite/fixed length. Alsonote that \K doesn't interfere with the setting of backreferences (e.g.,/(foo)\Kbar/ sets $1 to "foo"), and can be used within alternatives (e.g.,/foo|bar\Kbaz/ matches "foo" or "barbaz", but reports matching "foo" or"baz") or repeated groups, etc.


\K has a couple issues of its own, though:

* Perl docs say \K's behavior in lookaround is "not well defined". InPCRE, \K is acted upon within positive lookaround but ignored in negativelookaround.

* \K in ES would open the possibility of zero-length matches that changethe value of RegExp.prototype.lastIndex. This isn't a problem on its own,but it can effect how accompanying code must be written. It could also breakcode unwise to \K that accepts arbitrary regexes from an outside source.


-- Steven Levithan

[1]: http://search.cpan.org/~pinyan/Regexp-Keep-0.02/Keep.pm


_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: RegExp lookbehind

Reply via email to