This question is inspired by a recent post in comp.lang.perl.misc.
The goal is, given a collection of sub-regexps, write a regexp that
will generate the sets of locations corresponding to single matches of
these sub-regexps, irrespective of order, and doing as little
backtracking as possible.  For example, if the sub-regexps are /abc/,
/pqr/, and /xyz/, and the input string is given on the third line
below

#0         1         2
#01234567890123456789012345
'__pqr____xyz__pqr___abc___'

then the desired regexp should somehow generate the sets

 { (2,4) , (9,11) , (20,22) } and
 { (9,11), (14,16), (20,22) }

corresponding to the matches

'__pqr____xyz__pqr___abc___'
   ^^^    ^^^        ^^^
and

'__pqr____xyz__pqr___abc___'
          ^^^  ^^^   ^^^

For the time being, I'm assuming that the sub-regexps are
"non-overlapping".


One proposal that was posted, by way of "warm-up", was this:

     1  use strict;
     2  use re 'eval';
     3
     4  my @re0  = qw(abc pqr xyz);
     5  my @seen = (undef) x @re0;
     6  my @re   = map sprintf('%s(??{ $seen[%d] ||= "@-" })',
     7                         $re0[$_], $_),
     8                 0..$#re0;
     9  my $re   = eval "qr/@{[join('|', @re)]}/";
    10
    11  #0         1         2
    12  #01234567890123456789012345
    13  '__pqr____xyz__pqr___abc___' =~ /(($re).*?)*(?!)/;
    14
    15  print "\$seen[$_] = $seen[$_]\n" for (0..$#seen);
    16
    17  __END__

    $seen[0] = 20
    $seen[1] = 2
    $seen[2] = 9

Of course, this is incomplete, but it is a start.  One thing that
puzzles me is that the (?!) is absolutely required on line 13.
Without it, @seen doesn't get initialized at all.  Why should this be?
I would have thought that the final '*' in the regexp would be enough
to get the regexp to continue attempting to match.

Be that as it may, the need for the final (?!) suggests to me that
this regexp is doing far more backtracking than one would want.  Also
it suggests that no amount of tweaking it will make it find the second
set of starting points ({ 9, 14, 20 }).

Another thing that puzzles me about the code above is the ?? clause.
I'm amazed it works at all.  I thought it was supposed to evaluate to
a valid regexp, whereas in this case it evaluates to an integer.

Any light shed on these questions would be much appreciated.

kj

Reply via email to