Re: RFC 72 (v1) The regexp engine should go backward as well as forward.

Mark-Jason Dominus Wed, 30 Aug 2000 20:49:08 -0700

The big thing I find missing from this RFC is compelling examples.
You are proposing a major change to the regex engine but you only have
two examples.  Both involve only fixed strings and one of them is
artificial.  I really think you need to discuss in more detail why
this feature would be useful.

You specifically said that you wanted your feature to be able to match
expressions other than fixed strings, but you didn't give any examples
of that.

> With the proposed extension, you could write:
> 
>     m/GAAC(?r)(TTAAG| .... )/
> 
> and the regexp engine doesn't have to go looking deep into your regexp to
> know where it should start potential matches.

OK, now here it's not really clear why you would want to use your
feature instead of doing something like this instead:

        while (m/GAAC/g) {      
          last if substr($_, pos($_)-5, 5) eq 'GAATT';
          last if ...;
          ...;
        }

You could make an argument that yours is more compact, but my version
it could easily be wrapped into a subroutine, and it doesn't seem like
a particularly common operation, so it doesn't seem like there needs
to be another way to say this.  Of course, I might have completely
missed the point.  More and better examples would be a great help
here.

> As a frivolous illustration, the string 
> 
>       ABCDEFGHIJKLM
> 
> would be matched by:
> 
>     m/FG(?r)EDCB(?f)HIJK(?r)A^(?f)LM$/

If I understand your proposal correctly, it will not change the
behavior of the regex if you collect the (?f) and (/r) sesctions
together.  If this is true, then these all have the same meaning:

     m/FG(?r)EDCB(?f)HIJK(?r)A^(?f)LM$/       # Your example
     m/FGHIJK(?r)EDCB(?r)A^(?f)LM$/
     m/FGHIJK(?r)EDCBA^(?f)LM$/
     m/FGHIJKLM$(?r)EDCBA^/                   # Why not just say this?

If I am correct, then it doesn't appear that there is ever any reason
to have more than one (?r) and one (?f) in a single regex.  Also,
since there is in effect an implicit (?f) at the beginning of every
regex, you don't need a (?f) escape at all, as in the example I just
showed.  

Did I misunderstand your proposal?  Or did I miss seeing the
implication of some example that you didn't include?  If I am correct,
I think you should eliminate (?f) from your proposal, since it is not
useful.

> It will be important to know the offset where the match begins, as
> well as where it ends (indeed it would be nice to have that info in
> Perl5 without having to pay the C<length $&> performance penalty),
> so in addition to C<pos>, there might be a function C<prepos> to
> give the start of the match -- or C<pos> might return both end and
> start offsets in a list context.

OK, that's very nice, but you say you don't want the $& penalty.
I suspect from your discussion that you don't really understand that
$& penalty.  There are two parts to the $& penalty.

The first part is that maintaining the information for $& has a cost.
Maintaining this information for your prepos() function is going to
incur an identical cost.

The other part of the $& penalty is because $& itself is a global
variable, the penalty has to be paid by every regex in the program.
This is not a problem with the information in $&; it is a problem with
the interface to the information.  If the interface were different, $&
would not be a problem.  For example, if $& were only set on regexes
with a /k modifier, as proposed in RFC158, a lot of the pain of $&
would go away.

Now if something like RFC158 were adopted, then your rationale for
prepos() would go away, because length($&) would no longer be
particularly expensive.  At least, there would be no reason to suppose
it would be more expensive than your proposal.

However, a prepos() function had exactly the same problem as $&
presently has.  Whenever Perl did a regex match on any regex in the
entire program, it would have no way of knowing whether prepos() might
be called much later, so the cost of computing and storing the
prepos() information would be incurred.    

Rather than evading the $& problem, as you suggest, introducing
prepos() is going to make it even worse.

You can evade this problem by making prepos() lexically scoped.  For
example, prepos() information is only computed for regexes that have
the /q modifier on the end, or is only available inside the scope of a
'use prepos' declaration.  Either of these would fix this problem.

> I have no idea whether this feature will help people parsing right-to-left
> languages; it seems likely to help with bi-directional texts (see RFC 50).

I was wondering that myself, but I don't think it will, because RTL
text is not encoded backwards in the string itself.  It only *prints*
right-to-left.  But I may be mistaken, and I think you should consult
with Roman Parparov on this point before submitting the next revision
of this RFC.

Finally, some general comments: First, it seems to me that if there
were simply a better interface to pos() and to length($&), the need
for this feature would go away.  Let's suppose that there *was* a
list-context pos() function like the one you propose.

Then you can get the effect of /FOO(?r)BAR/ this way:

        while ($s =~ /FOO/g) {
          my ($start, $end) = pos($s);
          if (substr($s, 0, $start) =~ /RAB$/) {
            # /FOO(?r)BAR/ would have been true
          }
        }

Note that this works for arbitrary patterns FOO and BAR, not just for
fixed strings.

Second, you make some argument (which I didn't quite follow) about
optimization and the speed of the regex engine.  This is very shaky
ground.  That is like saying that when move your Grandma moves house,
you are going to transport her belongings in a race car, to take
advantage of its speed.  But by the time you finish loading Grandma's
paraphernalia onto the race car, it is no longer fast.  The regex
engine is very highly optimized, and at best (?r) and (?f) are likely
to defeat many of the optimizations that make the regex engine fast to
begin with.  (/i does this also.)  And at worst, making the regex
engine general enough to support this feature might make it much
slower even for patterns that don't use the feature.  The inner loop
of the regex has a pointer to the current character position in the
string.  The inner loop is full of s++ expressions that advance the
pointer one character at a time.  If the engine has to match backwards
also, these s++'es are all going to have to change to s+=d's or (d ?
s++ : s--)'es or some such.  

I am not saying that it could not be done, but if I were you I would
be very reluctant to make an argument from performance without talking
to someone with some expertise first.  Perhaps Hugo van der Sanden
would be willing to discuss this with you in more detail?

I hope you find these remarks helpful.
Re: RFC 72 (v1) The regexp engine should go backward as well as forward.

Reply via email to