This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE Variable-length lookbehind: the regexp engine should also go backward. =head1 VERSION Maintainer: Peter Heslin <[EMAIL PROTECTED]> Date: 9 Aug 2000 Last Modified: 1 Sep 2000 Mailing List: [EMAIL PROTECTED] Version: 3 Number: 72 Status: Developing =head1 ABSTRACT At present, lookbehind in regular expressions is limited to fixed-length, zero-width assertions. This RFC proposes one possible way of implementing fully-fledged lookbehind. =head1 CHANGES Version 3 is a revision to make it clearer that the purpose of the proposal is to implement a general scheme for variable-length lookbehind. The title has been changed to reflect this. Version 2 of this RFC redirects discussion of this topic to [EMAIL PROTECTED] =head1 DESCRIPTION It is proposed that the regular expression engine should be designed so that, when it is in an accepting state, it will either consume the character after the end of the currently matching part of the target string, or the character just before its beginning, depending on instructions embedded in the regexp itself. The purpose of this mechanism is to allow one to construct a regexp that matches a part of a string and then goes back to the part of the string before the match to continue there. =head2 Syntax I originally proposed two new regexp extensions which mean "jump and reverse" and "jump and go forward". These might be (?r) and (?f) or (?[) and (?]) respectively, or some other pair. It is important to note that this proposal does not suggest that the same characters in a string should be matched twice or more in a single hit by different parts of the same regexp going in different directions; that way madness lies. The "jump" part of these commands means to jump immediately to the beginning or the end of the currently matched part of the string and continue from there. In this way, the match continues to grow one character at a time, but at both ends. So /BC$(?r)A^/ matches ABC. An alternative was suggested by Mike Mulligan, which I have adapted somewhat to my own purposes: to use something like (?`= ... ) and (?`! ... ) instead. This has the advantage of looking more like the present lookbehind syntax that this proposal is meant to enhance. He argued that it will be confusing to have CBA in a regexp potentially match ABC, and this is probably true. Since with this syntax /DE(?`=ABC+)/ will match ABCCCCDE, this means that Perl has to do a little more work. It will have to read the pattern inside the (?`= ... ) as well as the target string from right to left, first matching the C+ from the start of the current match before D, and then the B, then the A. This serves to hide the backwards nature of the pattern matching from the programmer, which might be considered a good thing or a bad thing depending on your perspective. =head2 Examples Imagine a very long input string containing data such as this: ... GCAAGAATTGAACTGTAG ... If you want to match text that matches /GA+C/, but not when it follows /G+A+T+/, you cannot at present do so easily. Under this proposal, you might be able to it like this: /GA+C(?!(?r)T+A+G+)/, or under the alternative syntax: /GA+C(?`!G+A+T+)/. In addition to such negative assertions, there may be times when the ability to match in both directions might be useful in order to direct the regexp engine on how you wish it to approach the match. You could say /GA+C(?r)(T+A+G+)/ instead of /(G+A+T+)GA+C/, and be confident that the fixed string GA will be the thing that the regexp engine looks for first. In most cases Perl is smart enough to do this on its own, but in very complex cases it may get confused. If the programmer knows (or thinks she knows) the most efficient order in which to apply her regexp, then it might be useful if she had a means of passing this information on to Perl. If so, it could even be legal to have multiple instances of (?r) and (?f) in the same regexp. A long string containing: AABBCCDDEEFFGGHHIIJJKKLLMM would be matched there by: m/F+G+(?r)E+D+(?f)H+I+J+K+(?r)C+B+A+(?f)L+M+/ or, alternate syntax: m/F+G+(?`=D+E)H+I+J+K+(?`=A+B+C+)L+M+/ This would be preferable to m/A+B+C+D+E+F+G+H+I+J+K+L+M+/ if the programmer's knowledge of the target data led her to believe that this was the best way to approach the match (perhaps D and E are somewhat rare, A, B and C are very common, and G's following F's are known to be very rare). =head2 Related Features It will be important to know the offset into the target string where the match begins, as well as where it ends. This information would be useful in any case, even if the rest of this proposal is not approved. Perhaps the cleanest way to report this info would be to have C<pos> return both starting and ending offsets in a list context, if this can be done without a performance penalty. =head1 IMPLEMENTATION This is a potentially major change, and it is the sort of thing that could only be contemplated at the beginning of a major rewrite. There may be unforseen problems that will make this proposal impossible to implement, or it may result in imposing an unacceptable performance penalty on ordinary, forward matches. On the other hand, it may be that if the possibility that matches may run backward as well as forward is considered as the code is rewritten from scratch, this proposal will prove straightforward to implement, or at least easier than trying to tack on variable-length lookbehind as an afterthought. Here are some potential implementation issues: =head2 Backtracking Presumably, the regexp engine maintains something like a stack which is popped when a prospective match fails and it has to fall back to an earlier possibility. If so, when we jump and switch direction, we will have to push an indication of this onto the stack, so that when it is popped again we know to jump and switch the other way when backtracking off a failed match. =head2 Backreferences The text captured in parentheses while (?r) is operative will be reversed with respect to its order in the target string. If the alternative syntax is adopted, then this would not make sense, and so the (?`= ... ) would have to reverse these data before returning them to the user, so that the captured text will run forward. =head2 Multiple matching The C</g> modifier would behave just as it does currently: the next attempt begins the character after the end of the previous match. If the previous example of C</$(?r).*CBA/> were extended so that one might write C</(?r).*CBA/> (without the anchor at the start), then we would want in this case (i.e. when the regexp begins with (?r)) to make an exception to normal behavior and have the next match begin the character before the beginning of the previous one. =head1 REFERENCES _Mastering Regular Expressions_, Jeffrey Friedl (O'Reilly)
