This question is inspired by a recent post in comp.lang.perl.misc. The goal is, given a collection of sub-regexps, write a regexp that will generate the sets of locations corresponding to single matches of these sub-regexps, irrespective of order, and doing as little backtracking as possible. For example, if the sub-regexps are /abc/, /pqr/, and /xyz/, and the input string is given on the third line below
#0 1 2 #01234567890123456789012345 '__pqr____xyz__pqr___abc___' then the desired regexp should somehow generate the sets { (2,4) , (9,11) , (20,22) } and { (9,11), (14,16), (20,22) } corresponding to the matches '__pqr____xyz__pqr___abc___' ^^^ ^^^ ^^^ and '__pqr____xyz__pqr___abc___' ^^^ ^^^ ^^^ For the time being, I'm assuming that the sub-regexps are "non-overlapping". One proposal that was posted, by way of "warm-up", was this: 1 use strict; 2 use re 'eval'; 3 4 my @re0 = qw(abc pqr xyz); 5 my @seen = (undef) x @re0; 6 my @re = map sprintf('%s(??{ $seen[%d] ||= "@-" })', 7 $re0[$_], $_), 8 0..$#re0; 9 my $re = eval "qr/@{[join('|', @re)]}/"; 10 11 #0 1 2 12 #01234567890123456789012345 13 '__pqr____xyz__pqr___abc___' =~ /(($re).*?)*(?!)/; 14 15 print "\$seen[$_] = $seen[$_]\n" for (0..$#seen); 16 17 __END__ $seen[0] = 20 $seen[1] = 2 $seen[2] = 9 Of course, this is incomplete, but it is a start. One thing that puzzles me is that the (?!) is absolutely required on line 13. Without it, @seen doesn't get initialized at all. Why should this be? I would have thought that the final '*' in the regexp would be enough to get the regexp to continue attempting to match. Be that as it may, the need for the final (?!) suggests to me that this regexp is doing far more backtracking than one would want. Also it suggests that no amount of tweaking it will make it find the second set of starting points ({ 9, 14, 20 }). Another thing that puzzles me about the code above is the ?? clause. I'm amazed it works at all. I thought it was supposed to evaluate to a valid regexp, whereas in this case it evaluates to an integer. Any light shed on these questions would be much appreciated. kj