Re: RegExp lookbehind

Steven L. Sat, 17 Mar 2012 17:58:40 -0700

Lasse Reichstein wrote:

I would simply apply same logic we have already for the look ahead ... or
you think that would cause problems?


I'm not sure it even makes sense.

ES RegExps are backtracking based, and it makes a difference in which
order alternatives are tried. Greedy matching is defined in terms of
number of repetitions, not length of the match. All of these are
defined in a way that assumes left-to-right matching.

Example:
 Take the RegExp  /(?<((?:aa|aaa)+))b/  where (?< ... ) delimits the
look-behind.
 and try matching it on the string "xaaaaaaaaab".
 Then tell me how many a's are captured by the capturing group, and why :)

The most "intuitive" interpretation would be a reverse implementation
of the normal matching algorithm, i.e., "backwards matching", but that
would likely duplicate the entire RegExp semantics (or parameterize it
by a direction).

Any attempt to use the normal (forward) semantics and then try to find
an earlier point to start it at is likely to be either flawed or
effectively unpredictable to users.

Technically, you're right. They're different. But they can appear exactlythe same by implementing lookbehind as a zero-length assertion of(?:lookbehind)$ matched against the lookbehind's left context, starting fromthe very start of the subject string. Although people thinking aboutimplementation might come to think of some other approach as more intuitive,from my experience every single plain-old-developer unconcerned aboutimplementation thinks of the semantics I just described as intuitive. It isalso how every single implementation of lookbehind that I know of actuallyworks.

The reason that all major regex flavors except .NET don't support lookbehindis because it's inefficient to re-search from the very beginning of anarbitrarily long string. That's why they support fixed- or finit-lengthlookbehind only--if they can determine the maximum distance backward theyneed to search forward from, they can step back only that many characters.In practice, at least for finite- rather than fixed-length lookbehind, thisattempt to avoid far-back searches is kind of silly--e.g., Java lets you usea quantifier like {0,100000} within lookbehind.

The Right-to-Left Mode that powers .NET's lookbehind is pretty neat. Itmagically follows the plain-old-developer's intuitive expectation whileworking backword rather than from the start of the string. Unfortunately,how it actually works is fairly mysterious. Although it works fairlyreliably, as I previously mentioned it can occasionally be a bitbuggy/weird.

And you will probably never achieve that /(<re>)$/ and /(?<(re))$/
always capture the same substring :)

Apart from potential bugs, (<re>)$ and (?<=(<re>))$ capture the same stringin every implementation of lookbehind that I know of.


--Steven Levithan


_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: RegExp lookbehind

Reply via email to