Re: Another regexp performance improvement: skip useless paren-captures

Mark Dilger Mon, 09 Aug 2021 17:14:48 -0700

> On Aug 9, 2021, at 4:31 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> 
> There is a potentially interesting definitional question:
> what exactly ought this regexp do?
> 
>               ((.)){0}\2
> 
> Because the capturing paren sets are zero-quantified, they will
> never be matched to any characters, so the backref can never
> have any defined referent.

Perl regular expressions are not POSIX, but if there is a principled reason 
POSIX should differ from perl on this, we should be clear what that is:

    #!/usr/bin/perl

    use strict;
    use warnings;

    our $match;
    if ('foo' =~ m/((.)(??{ die; })){0}(..)/)
    {
        print "captured 1 $1\n" if defined $1;
        print "captured 2 $2\n" if defined $2;
        print "captured 3 $3\n" if defined $3;
        print "captured 4 $4\n" if defined $4;
        print "match = $match\n" if defined $match;
    }

This will print "captured 3 fo", proving that although the regular expression 
is parsed with the (..) bound to the third capture group, the first two capture 
groups never run.  If you don't believe that, change the {0} to {1} and observe 
that the script dies.

> So I think throwing an
> error is an appropriate response.  The existing code will
> throw such an error for
> 
>               ((.)){0}\1
> 
> so I guess Spencer did think about this to some extent -- he
> just forgot about the possibility of nested parens.

Ugg.  That means our code throws an error where perl does not, pretty well 
negating my point above.  If we're already throwing an error for this type of 
thing, I agree we should be consistent about it.  My personal preference would 
have been to do the same thing as perl, but it seems that ship has already 
sailed.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Another regexp performance improvement: skip useless paren-captures

Reply via email to