RFC 72 (v3) Variable-length lookbehind: the regexp engine should also go backward.

Perl6 RFC Librarian Fri, 01 Sep 2000 13:44:51 -0700
This and other RFCs are available on the web at
  http://dev.perl.org/rfc/

=head1 TITLE

Variable-length lookbehind: the regexp engine should also go backward. 

=head1 VERSION

  Maintainer: Peter Heslin <[EMAIL PROTECTED]>
  Date: 9 Aug 2000
  Last Modified: 1 Sep 2000
  Mailing List: [EMAIL PROTECTED]
  Version: 3
  Number: 72
  Status: Developing

=head1 ABSTRACT

At present, lookbehind in regular expressions is limited to
fixed-length, zero-width assertions.  This RFC proposes one possible way
of implementing fully-fledged lookbehind.

=head1 CHANGES

Version 3 is a revision to make it clearer that the purpose of the
proposal is to implement a general scheme for variable-length
lookbehind.  The title has been changed to reflect this.

Version 2 of this RFC redirects discussion of this topic to 
[EMAIL PROTECTED]

=head1 DESCRIPTION

It is proposed that the regular expression engine should be designed so that,
when it is in an accepting state, it will either consume the character after
the end of the currently matching part of the target string, or the character
just before its beginning, depending on instructions embedded in the
regexp itself.

The purpose of this mechanism is to allow one to construct a regexp that
matches a part of a string and then goes back to the part of the string
before the match to continue there. 

=head2 Syntax

I originally proposed two new regexp extensions which mean "jump and
reverse" and "jump and go forward".  These might be (?r) and (?f) or
(?[) and (?]) respectively, or some other pair.  It is important to note
that this proposal does not suggest that the same characters in a string
should be matched twice or more in a single hit by different parts of
the same regexp going in different directions; that way madness lies.  

The "jump" part of these commands means to jump immediately to the beginning
or the end of the currently matched part of the string and continue from there.
In this way, the match continues to grow one character at a time, but at both
ends.  So /BC$(?r)A^/ matches ABC.

An alternative was suggested by Mike Mulligan, which I have adapted
somewhat to my own purposes: to use something like (?`= ... ) and 
(?`! ... ) instead.  This has the advantage of looking more like the
present lookbehind syntax that this proposal is meant to enhance.  He
argued that it will be confusing to have CBA in a regexp potentially
match ABC, and this is probably true.  

Since with this syntax /DE(?`=ABC+)/ will match ABCCCCDE, this means
that Perl has to do a little more work.  It will have to read the
pattern inside the (?`= ... ) as well as the target string from right to
left, first matching the C+ from the start of the current match before
D, and then the B, then the A.  This serves to hide the backwards nature
of the pattern matching from the programmer, which might be considered a
good thing or a bad thing depending on your perspective.  

=head2 Examples

Imagine a very long input string containing data such as this:

    ... GCAAGAATTGAACTGTAG ...

If you want to match text that matches /GA+C/, but not when it
follows /G+A+T+/, you cannot at present do so easily.  Under this
proposal, you might be able to it like this: /GA+C(?!(?r)T+A+G+)/, or
under the alternative syntax: /GA+C(?`!G+A+T+)/.

In addition to such negative assertions, there may be times when the
ability to match in both directions might be useful in order to direct
the regexp engine on how you wish it to approach the match.  You could
say /GA+C(?r)(T+A+G+)/ instead of /(G+A+T+)GA+C/, and be confident that
the fixed string GA will be the thing that the regexp engine looks for
first.  In most cases Perl is smart enough to do this on its own, but in
very complex cases it may get confused.  If the programmer knows (or
thinks she knows) the most efficient order in which to apply her
regexp, then it might be useful if she had a means of passing this
information on to Perl.  If so, it could even be legal to have
multiple  instances of (?r) and (?f) in the same regexp. A long string
containing:

        AABBCCDDEEFFGGHHIIJJKKLLMM

would be matched there by:

    m/F+G+(?r)E+D+(?f)H+I+J+K+(?r)C+B+A+(?f)L+M+/

or, alternate syntax:

    m/F+G+(?`=D+E)H+I+J+K+(?`=A+B+C+)L+M+/

This would be preferable to m/A+B+C+D+E+F+G+H+I+J+K+L+M+/ if the
programmer's knowledge of the target data led her to believe that this
was the best way to approach the match (perhaps D and E are
somewhat rare,  A, B and C are very common, and G's following F's are
known to be very rare).

=head2 Related Features

It will be important to know the offset into the target string where
the match begins, as well as where it ends.  This information would be
useful in any case, even if the rest of this proposal is not approved.
Perhaps the cleanest way to report this info would be to have C<pos>
return both starting and ending offsets in a list context, if this can
be done without a performance penalty.

=head1 IMPLEMENTATION

This is a potentially major change, and it is the sort of thing that
could only be contemplated at the beginning of a major rewrite.  There
may be unforseen problems that will make this proposal impossible to
implement, or it may result in imposing an unacceptable performance
penalty on ordinary, forward matches.  On the other hand, it may be that
if the possibility that matches may run backward as well as forward is
considered as the code is rewritten from scratch, this proposal
will prove straightforward to implement, or at least easier than
trying to tack on variable-length lookbehind as an afterthought.

Here are some potential implementation issues:

=head2 Backtracking

Presumably, the regexp engine maintains something like a stack which is
popped when a prospective match fails and it has to fall back to an earlier 
possibility.  If so, when we jump and switch direction, we will have to push
an indication of this onto the stack, so that when it is popped again we
know to jump and switch the other way when backtracking off a failed match.

=head2 Backreferences

The text captured in parentheses while (?r) is operative will be reversed
with respect to its order in the target string.  If the alternative
syntax is adopted, then this would not make sense, and so the (?`= ... )
would have to reverse these data before returning them to the user, so
that the captured text will run forward. 

=head2 Multiple matching

The C</g> modifier would behave just as it does currently: the next attempt
begins the character after the end of the previous match.  If the previous
example of C</$(?r).*CBA/> were extended so that one might write C</(?r).*CBA/>
(without the anchor at the start), then we would want in this case (i.e. when the
regexp begins with (?r)) to make an exception to normal behavior and have the
next match begin the character before the beginning of the previous one.

=head1 REFERENCES

_Mastering Regular Expressions_, Jeffrey Friedl (O'Reilly)
RFC 72 (v3) Variable-length lookbehind: the regexp engine should also go backward.

Reply via email to