re_rests - extend test to show more buffers

Yves Orton via perl5-changes Sat, 14 Jan 2023 03:49:39 -0800

  Branch: refs/heads/yves/curlyx_curlym
  Home:   https://github.com/Perl/perl5
  Commit: d2742857dc280b101aee0d4264e692aba145b772
      
https://github.com/Perl/perl5/commit/d2742857dc280b101aee0d4264e692aba145b772
  Author: Yves Orton <[email protected]>
  Date:   2023-01-14 (Sat, 14 Jan 2023)


  Changed paths:
    M t/re/re_tests

  Log Message:
  -----------
  t/re/re_rests - extend test to show more buffers

This is a tricky test, showing more buffers makes it a bit easier
to understand if you break it. (Guess what I did?)


  Commit: 3511c1099c8a1934e59cfc23d593815beb60ce4e
      
https://github.com/Perl/perl5/commit/3511c1099c8a1934e59cfc23d593815beb60ce4e
  Author: Yves Orton <[email protected]>
  Date:   2023-01-14 (Sat, 14 Jan 2023)

  Changed paths:
    M regcomp.c
    M regcomp.h
    M regcomp_internal.h
    M t/re/pat.t
    M t/re/reg_mesg.t

  Log Message:
  -----------
  regcomp.c - increase size of CURLY nodes so the min/max is a I32

This allows us to resolve a test inconsistency between CURLYX and CURLY
and CURLYM. We use I32 because the existing count logic uses -1 and
this keeps everything unsigned compatible.


  Commit: ad814caeaef5c0cbcceb375f82e7bddeab3a0069
      
https://github.com/Perl/perl5/commit/ad814caeaef5c0cbcceb375f82e7bddeab3a0069
  Author: Yves Orton <[email protected]>
  Date:   2023-01-14 (Sat, 14 Jan 2023)

  Changed paths:
    M regcomp_internal.h
    M regcomp_study.c

  Log Message:
  -----------
  regcomp_study.c - Add a way to disable CURLYX optimisations

Also break up the condition so there is one condition per line so
it is more readable, and fold repeated binary tests together. This
makes it more obvious what the expression is doing.


  Commit: bb40ff57a775338b6c6768c31b5924cee92fd610
      
https://github.com/Perl/perl5/commit/bb40ff57a775338b6c6768c31b5924cee92fd610
  Author: Yves Orton <[email protected]>
  Date:   2023-01-14 (Sat, 14 Jan 2023)

  Changed paths:
    M regcomp_debug.c
    M regcomp_study.c
    M t/re/pat_re_eval.t

  Log Message:
  -----------
  regcomp_study.c - disable CURLYX optimizations when EVAL has been seen 
anywhere

Historically we disabled CURLYX optimizations when they
*contained* an EVAL, on the assumption that the optimization might
affect how many times, etc, the eval was called. However, this is
also true for CURLYX with evals *afterwards*. If the CURLYN or CURLYM
optimization can prune off the search space, then an eval afterwards
will be affected. An when you take into account GOSUB, it means that
an eval in front might be affected by an optimization after it.

So for now we disable CURLYN and CURLYM in any pattern with an EVAL.


  Commit: 9fd814f5381aacbef69081e862e6c7978bc60caa
      
https://github.com/Perl/perl5/commit/9fd814f5381aacbef69081e862e6c7978bc60caa
  Author: Yves Orton <[email protected]>
  Date:   2023-01-14 (Sat, 14 Jan 2023)

  Changed paths:
    M regexec.c

  Log Message:
  -----------
  regexec.c - rework CLOSE_CAPTURE() macro to take a rex argument

This allows it to be used in contexts where rex isn't set up under
this name.


  Commit: 217576d3ee9f2c10c4f0801466cef1f7607ff05f
      
https://github.com/Perl/perl5/commit/217576d3ee9f2c10c4f0801466cef1f7607ff05f
  Author: Yves Orton <[email protected]>
  Date:   2023-01-14 (Sat, 14 Jan 2023)

  Changed paths:
    M regcomp.c
    M regcomp.h

  Log Message:
  -----------
  regcomp.h - get rid of EXTRA_STEP defines

They are unused these days.


  Commit: 7c888ec0e71d0745805eeffe10129e00d48017e1
      
https://github.com/Perl/perl5/commit/7c888ec0e71d0745805eeffe10129e00d48017e1
  Author: Yves Orton <[email protected]>
  Date:   2023-01-14 (Sat, 14 Jan 2023)

  Changed paths:
    M regcomp.c

  Log Message:
  -----------
  regcomp.c - add whitespace to binary operation

The tight & is hard to read.


  Commit: 89af12057d28b19cc98ffd7c48ce1fc0391af837
      
https://github.com/Perl/perl5/commit/89af12057d28b19cc98ffd7c48ce1fc0391af837
  Author: Yves Orton <[email protected]>
  Date:   2023-01-14 (Sat, 14 Jan 2023)

  Changed paths:
    M regcomp_trie.c

  Log Message:
  -----------
  regcomp_trie.c - use the indirect types so we are safe to changes

We shouldnt assume that a TRIEC is a regcomp_charclass. We have a per
opcode type exactly for this type of use, so lets use it.


  Commit: da235079408ac8320da15d22210c363441e4a3ef
      
https://github.com/Perl/perl5/commit/da235079408ac8320da15d22210c363441e4a3ef
  Author: Yves Orton <[email protected]>
  Date:   2023-01-14 (Sat, 14 Jan 2023)

  Changed paths:
    M pod/perldebguts.pod
    M pp_ctl.c
    M regcomp.c
    M regcomp.h
    M regcomp.sym
    M regcomp_debug.c
    M regexec.c
    M regexp.h
    M regnodes.h
    M t/re/pat.t
    M t/re/pat_rt_report.t
    M t/re/re_tests

  Log Message:
  -----------
  regcomp.c - Resolve issues clearing buffers in CURLYX (MAJOR-CHANGE)

CURLYX doesn't reset capture buffers properly. It is possible
for multiple buffers to be defined at once with values from
different iterations of the loop, which doesn't make sense really.

An example is this:

  "foobarfoo"=~/((foo)|(bar))+/

after this matches $1 should equal $2 and $3 should be undefined,
or $1 should equal $3 and $2 should be undefined. Prior to this
patch this would not be the case.

The solution that this patches uses is to introduce a form of
"layered transactional storage" for paren data. The existing
pair of start/end data for capture data is extended with a
start_new/end_new pair. When the vast majority of our code wants
to check if a given capture buffer is defined they first check
"start_new/end_new", if either is -1 then they fall back to
whatever is in start/end.

When a capture buffer is CLOSEd the data is written into the
start_new/end_new pair instead of the start/end pair. When a CURLYX
loop is executing and has matched something (at least one "A" in
/A*B/ -- thus actually in WHILEM) it "commits" the start_new/end_new
data by writing it into start/end. When we begin a new iteration of
the loop we clear the start_new/end_new pairs that are contained by
the loop, by setting them to -1. If the loop fails then we roll back
as we used to. If the loop succeeds we continue. When we hit an END
block we commit everything.

Consider the example above. We start off with everything set to -1.

 $1 = (-1,-1):(-1,-1)
 $2 = (-1,-1):(-1,-1)
 $3 = (-1,-1):(-1,-1)

In the first iteration we have matched "foo" and end up with this:

 $1 = (-1,-1):( 0, 3)
 $2 = (-1,-1):( 0, 3)
 $3 = (-1,-1):(-1,-1)

We commit the results of $2 and $3, and then clear the new data in
the beginning of the next loop:

 $1 = (-1,-1):( 0, 3)
 $2 = ( 0, 3):(-1,-1)
 $3 = (-1,-1):(-1,-1)

We then match "bar":

 $1 = (-1,-1):( 0, 3)
 $2 = ( 0, 3):(-1,-1)
 $3 = (-1,-1):( 3, 7)

and then commit the result and clear the new data:

 $1 = (-1,-1):( 0, 3)
 $2 = (-1,-1):(-1,-1)
 $3 = ( 3, 7):(-1,-1)

and then we match "foo" again:

 $1 = (-1,-1):( 0, 3)
 $2 = (-1,-1):( 7,10)
 $3 = ( 3, 7):(-1,-1)

And we then commit. We do a regcppush here as normal.

 $1 = (-1,-1):( 0, 3)
 $2 = ( 7,10):( 7,10)
 $3 = (-1,-1):(-1,-1)

We then clear it again, but since we don't match when we regcppop
we store the buffers back to the above layout. When we finally
hit the END buffer we also do a commit as well on all buffers, including
the 0th (for the full match).

Fixes GH Issue #18865, and adds tests for it and other things.


  Commit: fa4cbd0a1fdea9effbd022167ca2cf7843fe8b91
      
https://github.com/Perl/perl5/commit/fa4cbd0a1fdea9effbd022167ca2cf7843fe8b91
  Author: Yves Orton <[email protected]>
  Date:   2023-01-14 (Sat, 14 Jan 2023)

  Changed paths:
    M MANIFEST
    M t/re/regexp.t
    A t/re/regexp_normal.t

  Log Message:
  -----------
  t/re/regexp_normal.t - test "normalized" forms of patterns

This looks for discrepancies between different ways of writing
a pattern.


  Commit: 9b97bdf90c334824357407c6f01ff9917879e1e5
      
https://github.com/Perl/perl5/commit/9b97bdf90c334824357407c6f01ff9917879e1e5
  Author: Yves Orton <[email protected]>
  Date:   2023-01-14 (Sat, 14 Jan 2023)

  Changed paths:
    M pod/perldebguts.pod
    M regcomp.c
    M regcomp.h
    M regcomp.sym
    M regcomp_debug.c
    M regcomp_trie.c
    M regexec.c
    M regexp.h
    M regnodes.h
    M t/re/re_tests

  Log Message:
  -----------
  regexec.c - teach BRANCH and BRANCHJ nodes to reset capture buffers

In /((a)(b)|(a))+/ we should not end up with $2 and $4 being set at
the same time. When a branch fails it should reset any capture buffers
that might be touched by its branch.

We change BRANCH and BRANCHJ to store the number of parens before the
branch, and the number of parens after the branch was completed. When
a BRANCH operation fails, we clear the buffers it contains before we
continue on.

It is a bit more complex than it should be because we have BRANCHJ
and BRANCH. (One of these days we should merge them together.)

This is also made somewhat more complex because TRIE nodes are actually
branches, and may need to track capture buffers also, at two levels.
The overall TRIE op, and for jump tries especially where we emulate
the behavior of branches. So we have to do the same clearing logic if
a trie branch fails as well.


  Commit: 300e9a1c96bc39133f2f3f3d3fc5762d66a56087
      
https://github.com/Perl/perl5/commit/300e9a1c96bc39133f2f3f3d3fc5762d66a56087
  Author: Yves Orton <[email protected]>
  Date:   2023-01-14 (Sat, 14 Jan 2023)

  Changed paths:
    M pod/perldelta.pod
    M pod/perlre.pod
    M regcomp.c
    M regcomp.h
    M regcomp_debug.c
    M regcomp_internal.h
    M regcomp_study.c
    M regexec.c
    M regnodes.h
    M t/re/pat_re_eval.t
    M t/re/pat_rt_report.t
    M toke.c

  Log Message:
  -----------
  regcomp.c - add optimistic eval

This adds (*{ ... }) and (**{ ... }) as equivalents to
(?{ ... }) and (??{ ... }). The only difference being that
the star variants are "optimisitic" and are defined to never
disable optimisations.  This is especially relevant now that
use of (?{ ... }) prevents important optimisations anywhere
in the pattern, instead of the older and inconsistent rules
where it only affected the parts that contained the EVAL.

It is also very useful for injecting debugging style expressions
to the pattern to understand what the regex engine is actually
doing. The older style (?{ ... }) variants would change the
regex engines behavior, meaning this was not as effective a
tool as it could have been.


  Commit: 5a00917093220023413489ee7d5d131ea5112e8a
      
https://github.com/Perl/perl5/commit/5a00917093220023413489ee7d5d131ea5112e8a
  Author: Yves Orton <[email protected]>
  Date:   2023-01-14 (Sat, 14 Jan 2023)

  Changed paths:
    M regexec.c
    M t/re/pat_re_eval.t
    M t/re/regexp.t

  Log Message:
  -----------
  regexec.c - fix accept in CURLYX/WHILEM construct.

The ACCEPT logic didnt know how to handle WHILEM, which for
some reason does not have a next_off defined. I am not sure why.

This was revealed by forcing CURLYX optimisations off. This includes
a patch to test what happens if we embed an eval group in the tests
run by regexp.t when run via regexp_normal.t, which disabled CURLYX ->
CURLYN and CURLYM optimisations and revealed this issue.


  Commit: 7c3eaac18ef26c5ebcdb2222195fd5d33b1c37be
      
https://github.com/Perl/perl5/commit/7c3eaac18ef26c5ebcdb2222195fd5d33b1c37be
  Author: Yves Orton <[email protected]>
  Date:   2023-01-14 (Sat, 14 Jan 2023)

  Changed paths:
    M pod/perldelta.pod

  Log Message:
  -----------
  perldelta - add note about regex engine changes

capture buffer semantics should now be consistent.


  Commit: d9ea8c8f3cedb396c138432d0568c3465b073dab
      
https://github.com/Perl/perl5/commit/d9ea8c8f3cedb396c138432d0568c3465b073dab
  Author: Yves Orton <[email protected]>
  Date:   2023-01-14 (Sat, 14 Jan 2023)

  Changed paths:
    M regexec.c

  Log Message:
  -----------
  regexec.c - fix memory leak in EVAL.

EVAL was calling regcppush twice per invocation, once before
executing the callback, and once after. But not regcppop'ing
twice. So each time we would accumulate an extra "frame" of
data. This is/was hidden somewhat by the way we eventually
"blow" the stack, so the extra data was just thrown away at
the end.

This removes the second set of pushes so that the save stack
stays a stable size as it unwinds from each failed eval.


  Commit: e76fc1d64fe9bd4f88374fd0d9194ef52c63f606
      
https://github.com/Perl/perl5/commit/e76fc1d64fe9bd4f88374fd0d9194ef52c63f606
  Author: Yves Orton <[email protected]>
  Date:   2023-01-14 (Sat, 14 Jan 2023)

  Changed paths:
    M regexec.c
    M regexp.h
    M t/re/re_tests

  Log Message:
  -----------
  regexec.c - incredibly inefficient solution to backref problem

Backrefs to unclosed parens inside of a quantified group were not being
properly handled, which revealed we are not unrolling the paren state properly
on failure and backtracking.

Much of the code assumes that when we execute a "conditional" operation (where
more than one thing could match) that we need not concern ourself with the
paren state unless the conditional operation itself represents a paren, and
that generally opcodes only needed to concern themselves with parens to their
right. When you exclude backrefs from the equation this is broadly reasonable
(i think), as on failure we typically dont care about the state of the paren
buffers. They either get reset as we find a new different accepting pathway,
or their state is irrelevant if the overal match is rejected (eg it fails).

However backreferences are different. Consider the following pattern
from the tests

    "xa=xaaa" =~ /^(xa|=?\1a){2}\z/

in the first iteration through this the first branch matches, and in fact
because the \1 is in the second branch it can't match on the first iteration
at all. After this $1 = "xa". We then perform the second iteration. "xa" does
not match "=xaaa" so we fall to the second branch. The '=?' matches, but sets
up a backtracking action to not match if the rest of the pattern does not
match. \1 matches 'xa', and then the 'a' matches, leaving an unmatched 'a' in
the string, we exit the quantifier loop with $1 = "=xaa" and match \z against
the remaining "a" in the pattern, and fail.

Here is where things go wrong in the old code, we unwind to the outer loop,
but we do not unwind the paren state. We then unwind further into the 2nds
iteration of the loop, to the '=?' where we then try to match the tail with
the quantifier matching the empty string. We then match the old $1 (which was
not unwound) as "=xaa", and then the "a" matches, and we are the end of the
string and we have incorrectly accpeted this string as matching the pattern.

What should have happend was when the \1 was resolved the second time it
should have returned the same string as it did when the =? matched '=', which
then would have resulted in the tail matching again, and etc, eventually
unwinding the entire pattern when the second iteration failed entirely.

This patch is very crude. It simple pushes the state of the parens and creates
and unwind point for every case where we do a transition to a B or _next
operation, and we make the corresponding _next_fail do the appropriate
unwinding. The objective was to achieve correctness and then work towards
making it more efficient. We almost certainly overstore items on the stack.

In a future patch we can perhaps keep track of the unclosed parens before the
relevant operators and make sure that they are properly pushed and unwound at
the correct times.


  Commit: bf950474aacfa9dde95f0d3652b3b813713c7b35
      
https://github.com/Perl/perl5/commit/bf950474aacfa9dde95f0d3652b3b813713c7b35
  Author: Yves Orton <[email protected]>
  Date:   2023-01-14 (Sat, 14 Jan 2023)

  Changed paths:
    M t/re/pat_re_eval.t

  Log Message:
  -----------
  t/re/pat_re_eval.t - add note to test

This test will fail if CURLYX optimisations are disabled, and currently
we do not have a good way to detect if they have been forced off. So add
a note that the test is known to fail when optimisations are disabled, so
people do not look for bugs. This test is testing that an optimisation
does happen, so disabling the optimisation will necessarily make it fail.


Compare: https://github.com/Perl/perl5/compare/7fc54939fb81...bf950474aacf

[Perl/perl5] d27428: t/re/re_rests - extend test to show more buffers

Reply via email to