Branch: refs/heads/yves/curlyx_curlym
Home: https://github.com/Perl/perl5
Commit: 88434cc3bc39d9907d982d99e8256f3e041c79ef
https://github.com/Perl/perl5/commit/88434cc3bc39d9907d982d99e8256f3e041c79ef
Author: Yves Orton <[email protected]>
Date: 2023-01-15 (Sun, 15 Jan 2023)
Changed paths:
M t/re/re_tests
Log Message:
-----------
t/re/re_rests - extend test to show more buffers
This is a tricky test, showing more buffers makes it a bit easier
to understand if you break it. (Guess what I did?)
Commit: b06fbb4fdf8e0ba55286dafb629937b24ec9adb5
https://github.com/Perl/perl5/commit/b06fbb4fdf8e0ba55286dafb629937b24ec9adb5
Author: Yves Orton <[email protected]>
Date: 2023-01-15 (Sun, 15 Jan 2023)
Changed paths:
M regcomp_internal.h
M regcomp_study.c
Log Message:
-----------
regcomp_study.c - Add a way to disable CURLYX optimisations
Also break up the condition so there is one condition per line so
it is more readable, and fold repeated binary tests together. This
makes it more obvious what the expression is doing.
Commit: 59022ac481b0b72b7c14f1fcfd0e3090bcbdab95
https://github.com/Perl/perl5/commit/59022ac481b0b72b7c14f1fcfd0e3090bcbdab95
Author: Yves Orton <[email protected]>
Date: 2023-01-15 (Sun, 15 Jan 2023)
Changed paths:
M regexec.c
Log Message:
-----------
regexec.c - rework CLOSE_CAPTURE() to take rex as an arg to enable reuse.
This also splits up CLOSE_CAPTURE() into two parts, with the important parts
implemented by CLOSE_ANY_CAPTURE(), and the debugging parts in
CLOSE_CAPTURE(). This allows it to be used in contexts where the regexp
structure isn't set up under the name 'rex', and where the debugging output it
includes might not be relevant or possible to produce.
This encapsulates all the places that "close" a capture buffer, and ensures
that they are closed properly. One important case in particular cannot use
CLOSE_CAPTURE() directly, as it does not have a 'rex' variable in scope (it is
called prog in this function), nor the debugging context used in normal
execution of CLOSE_CAPTURE(). Using CLOSE_ANY_CAPTURE() instead means all the
main points that update capture buffer state use the same macro API.
Commit: 84e90860ab0be2e5997a91225a358f946210bc7f
https://github.com/Perl/perl5/commit/84e90860ab0be2e5997a91225a358f946210bc7f
Author: Yves Orton <[email protected]>
Date: 2023-01-15 (Sun, 15 Jan 2023)
Changed paths:
M regcomp.c
M regcomp.h
Log Message:
-----------
regcomp.h - get rid of EXTRA_STEP defines
They are unused these days.
Commit: d3fa340271178edacc139aa4169a556e9d1a9ec5
https://github.com/Perl/perl5/commit/d3fa340271178edacc139aa4169a556e9d1a9ec5
Author: Yves Orton <[email protected]>
Date: 2023-01-15 (Sun, 15 Jan 2023)
Changed paths:
M regcomp.c
Log Message:
-----------
regcomp.c - add whitespace to binary operation
The tight & is hard to read.
Commit: 784fccd8e85cca906c20b5e97b5088d2f667ff92
https://github.com/Perl/perl5/commit/784fccd8e85cca906c20b5e97b5088d2f667ff92
Author: Yves Orton <[email protected]>
Date: 2023-01-15 (Sun, 15 Jan 2023)
Changed paths:
M regcomp_trie.c
Log Message:
-----------
regcomp_trie.c - use the indirect types so we are safe to changes
We shouldnt assume that a TRIEC is a regcomp_charclass. We have a per
opcode type exactly for this type of use, so lets use it.
Commit: ae66e45b465d5f1a186a0f449bdb8597bd754f84
https://github.com/Perl/perl5/commit/ae66e45b465d5f1a186a0f449bdb8597bd754f84
Author: Yves Orton <[email protected]>
Date: 2023-01-15 (Sun, 15 Jan 2023)
Changed paths:
M regcomp_debug.c
M regcomp_study.c
M t/re/pat_re_eval.t
Log Message:
-----------
regcomp_study.c - disable CURLYX optimizations when EVAL has been seen
anywhere
Historically we disabled CURLYX optimizations when they
*contained* an EVAL, on the assumption that the optimization might
affect how many times, etc, the eval was called. However, this is
also true for CURLYX with evals *afterwards*. If the CURLYN or CURLYM
optimization can prune off the search space, then an eval afterwards
will be affected. An when you take into account GOSUB, it means that
an eval in front might be affected by an optimization after it.
So for now we disable CURLYN and CURLYM in any pattern with an EVAL.
Commit: 8bd6445f12d47068a263b576fc04146a9a2886b4
https://github.com/Perl/perl5/commit/8bd6445f12d47068a263b576fc04146a9a2886b4
Author: Yves Orton <[email protected]>
Date: 2023-01-15 (Sun, 15 Jan 2023)
Changed paths:
M MANIFEST
M ext/XS-APItest/APItest.xs
A ext/XS-APItest/t/savestack.t
M regexec.c
Log Message:
-----------
regexec.c - fix memory leak in EVAL.
EVAL was calling regcppush twice per invocation, once before executing the
callback, and once after. But not regcppop'ing twice. So each time we
would accumulate an extra "frame" of data. This is/was hidden somewhat by
the way we eventually "blow" the stack, so the extra data was just thrown
away at the end.
This removes the second set of pushes so that the save stack stays a stable
size as it unwinds from each failed eval.
We also weren't cleaning up after a (?{...}) when we failed to match to its
right. This unwinds the stack and restores the parens properly.
This adds tests to check how the save stack grows during patterns using
(?{ ... }) and (??{ ... }) and ensure that when we backtrack and re-execute
the EVAL again it cleans up the stack as it goes.
Commit: 527ac046fc9ff3b9e5013861a188a3f02789f825
https://github.com/Perl/perl5/commit/527ac046fc9ff3b9e5013861a188a3f02789f825
Author: Yves Orton <[email protected]>
Date: 2023-01-15 (Sun, 15 Jan 2023)
Changed paths:
M MANIFEST
M regexec.c
M t/re/pat_re_eval.t
M t/re/re_tests
M t/re/regexp.t
A t/re/regexp_normal.t
Log Message:
-----------
regexec.c - fix accept in CURLYX/WHILEM construct.
The ACCEPT logic didnt know how to handle WHILEM, which for
some reason does not have a next_off defined. I am not sure why.
This was revealed by forcing CURLYX optimisations off. This includes
a patch to test what happens if we embed an eval group in the tests
run by regexp.t when run via regexp_normal.t, which disabled CURLYX ->
CURLYN and CURLYM optimisations and revealed this issue.
This adds t/re/regexp_normal.t which test "normalized" forms of
the patterns in t/re/re_tests by munging them in various ways
to see if they still behave as expected. For instance injecting
a (?{}) can disable optimisations.
Commit: 89f66c3e4e28ec7e7413e9d3f2403fe6b119cb08
https://github.com/Perl/perl5/commit/89f66c3e4e28ec7e7413e9d3f2403fe6b119cb08
Author: Yves Orton <[email protected]>
Date: 2023-01-15 (Sun, 15 Jan 2023)
Changed paths:
M regcomp.c
M regcomp.h
M regcomp_internal.h
M t/re/pat.t
M t/re/reg_mesg.t
Log Message:
-----------
regcomp.c - increase size of CURLY nodes so the min/max is a I32
This allows us to resolve a test inconsistency between CURLYX and CURLY
and CURLYM. We use I32 because the existing count logic uses -1 and
this keeps everything unsigned compatible.
Commit: b23593312c27df95abaca4344124f6b10a5a5071
https://github.com/Perl/perl5/commit/b23593312c27df95abaca4344124f6b10a5a5071
Author: Yves Orton <[email protected]>
Date: 2023-01-15 (Sun, 15 Jan 2023)
Changed paths:
M pod/perldebguts.pod
M pp_ctl.c
M regcomp.c
M regcomp.h
M regcomp.sym
M regcomp_debug.c
M regexec.c
M regexp.h
M regnodes.h
M t/re/pat.t
M t/re/pat_rt_report.t
M t/re/re_tests
Log Message:
-----------
regcomp.c - Resolve issues clearing buffers in CURLYX (MAJOR-CHANGE)
CURLYX doesn't reset capture buffers properly. It is possible
for multiple buffers to be defined at once with values from
different iterations of the loop, which doesn't make sense really.
An example is this:
"foobarfoo"=~/((foo)|(bar))+/
after this matches $1 should equal $2 and $3 should be undefined,
or $1 should equal $3 and $2 should be undefined. Prior to this
patch this would not be the case.
The solution that this patches uses is to introduce a form of
"layered transactional storage" for paren data. The existing
pair of start/end data for capture data is extended with a
start_new/end_new pair. When the vast majority of our code wants
to check if a given capture buffer is defined they first check
"start_new/end_new", if either is -1 then they fall back to
whatever is in start/end.
When a capture buffer is CLOSEd the data is written into the
start_new/end_new pair instead of the start/end pair. When a CURLYX
loop is executing and has matched something (at least one "A" in
/A*B/ -- thus actually in WHILEM) it "commits" the start_new/end_new
data by writing it into start/end. When we begin a new iteration of
the loop we clear the start_new/end_new pairs that are contained by
the loop, by setting them to -1. If the loop fails then we roll back
as we used to. If the loop succeeds we continue. When we hit an END
block we commit everything.
Consider the example above. We start off with everything set to -1.
$1 = (-1,-1):(-1,-1)
$2 = (-1,-1):(-1,-1)
$3 = (-1,-1):(-1,-1)
In the first iteration we have matched "foo" and end up with this:
$1 = (-1,-1):( 0, 3)
$2 = (-1,-1):( 0, 3)
$3 = (-1,-1):(-1,-1)
We commit the results of $2 and $3, and then clear the new data in
the beginning of the next loop:
$1 = (-1,-1):( 0, 3)
$2 = ( 0, 3):(-1,-1)
$3 = (-1,-1):(-1,-1)
We then match "bar":
$1 = (-1,-1):( 0, 3)
$2 = ( 0, 3):(-1,-1)
$3 = (-1,-1):( 3, 7)
and then commit the result and clear the new data:
$1 = (-1,-1):( 0, 3)
$2 = (-1,-1):(-1,-1)
$3 = ( 3, 7):(-1,-1)
and then we match "foo" again:
$1 = (-1,-1):( 0, 3)
$2 = (-1,-1):( 7,10)
$3 = ( 3, 7):(-1,-1)
And we then commit. We do a regcppush here as normal.
$1 = (-1,-1):( 0, 3)
$2 = ( 7,10):( 7,10)
$3 = (-1,-1):(-1,-1)
We then clear it again, but since we don't match when we regcppop
we store the buffers back to the above layout. When we finally
hit the END buffer we also do a commit as well on all buffers, including
the 0th (for the full match).
Fixes GH Issue #18865, and adds tests for it and other things.
Commit: 244390e0bec74f97925c0e52e19e33457c6dcb30
https://github.com/Perl/perl5/commit/244390e0bec74f97925c0e52e19e33457c6dcb30
Author: Yves Orton <[email protected]>
Date: 2023-01-15 (Sun, 15 Jan 2023)
Changed paths:
M pod/perldebguts.pod
M regcomp.c
M regcomp.h
M regcomp.sym
M regcomp_debug.c
M regcomp_trie.c
M regexec.c
M regexp.h
M regnodes.h
M t/re/re_tests
Log Message:
-----------
regexec.c - teach BRANCH and BRANCHJ nodes to reset capture buffers
In /((a)(b)|(a))+/ we should not end up with $2 and $4 being set at
the same time. When a branch fails it should reset any capture buffers
that might be touched by its branch.
We change BRANCH and BRANCHJ to store the number of parens before the
branch, and the number of parens after the branch was completed. When
a BRANCH operation fails, we clear the buffers it contains before we
continue on.
It is a bit more complex than it should be because we have BRANCHJ
and BRANCH. (One of these days we should merge them together.)
This is also made somewhat more complex because TRIE nodes are actually
branches, and may need to track capture buffers also, at two levels.
The overall TRIE op, and for jump tries especially where we emulate
the behavior of branches. So we have to do the same clearing logic if
a trie branch fails as well.
Commit: 0774c54171ca84381df18b55f9a2ee6d94545a61
https://github.com/Perl/perl5/commit/0774c54171ca84381df18b55f9a2ee6d94545a61
Author: Yves Orton <[email protected]>
Date: 2023-01-15 (Sun, 15 Jan 2023)
Changed paths:
M pod/perldelta.pod
M pod/perlre.pod
M regcomp.c
M regcomp.h
M regcomp_debug.c
M regcomp_internal.h
M regcomp_study.c
M regexec.c
M regnodes.h
M t/re/pat_re_eval.t
M t/re/pat_rt_report.t
M toke.c
Log Message:
-----------
regcomp.c - add optimistic eval
This adds (*{ ... }) and (**{ ... }) as equivalents to
(?{ ... }) and (??{ ... }). The only difference being that
the star variants are "optimisitic" and are defined to never
disable optimisations. This is especially relevant now that
use of (?{ ... }) prevents important optimisations anywhere
in the pattern, instead of the older and inconsistent rules
where it only affected the parts that contained the EVAL.
It is also very useful for injecting debugging style expressions
to the pattern to understand what the regex engine is actually
doing. The older style (?{ ... }) variants would change the
regex engines behavior, meaning this was not as effective a
tool as it could have been.
Commit: 38fa0c3ede70e137bd4b97ee6065ba7f38f23a74
https://github.com/Perl/perl5/commit/38fa0c3ede70e137bd4b97ee6065ba7f38f23a74
Author: Yves Orton <[email protected]>
Date: 2023-01-15 (Sun, 15 Jan 2023)
Changed paths:
M pod/perldelta.pod
Log Message:
-----------
perldelta - add note about regex engine changes
capture buffer semantics should now be consistent.
Commit: eee7338e3f4910513282715d9541357adbcf1701
https://github.com/Perl/perl5/commit/eee7338e3f4910513282715d9541357adbcf1701
Author: Yves Orton <[email protected]>
Date: 2023-01-15 (Sun, 15 Jan 2023)
Changed paths:
M regexec.c
M regexp.h
M t/re/re_tests
Log Message:
-----------
regexec.c - incredibly inefficient solution to backref problem
Backrefs to unclosed parens inside of a quantified group were not being
properly handled, which revealed we are not unrolling the paren state properly
on failure and backtracking.
Much of the code assumes that when we execute a "conditional" operation (where
more than one thing could match) that we need not concern ourself with the
paren state unless the conditional operation itself represents a paren, and
that generally opcodes only needed to concern themselves with parens to their
right. When you exclude backrefs from the equation this is broadly reasonable
(i think), as on failure we typically dont care about the state of the paren
buffers. They either get reset as we find a new different accepting pathway,
or their state is irrelevant if the overal match is rejected (eg it fails).
However backreferences are different. Consider the following pattern
from the tests
"xa=xaaa" =~ /^(xa|=?\1a){2}\z/
in the first iteration through this the first branch matches, and in fact
because the \1 is in the second branch it can't match on the first iteration
at all. After this $1 = "xa". We then perform the second iteration. "xa" does
not match "=xaaa" so we fall to the second branch. The '=?' matches, but sets
up a backtracking action to not match if the rest of the pattern does not
match. \1 matches 'xa', and then the 'a' matches, leaving an unmatched 'a' in
the string, we exit the quantifier loop with $1 = "=xaa" and match \z against
the remaining "a" in the pattern, and fail.
Here is where things go wrong in the old code, we unwind to the outer loop,
but we do not unwind the paren state. We then unwind further into the 2nds
iteration of the loop, to the '=?' where we then try to match the tail with
the quantifier matching the empty string. We then match the old $1 (which was
not unwound) as "=xaa", and then the "a" matches, and we are the end of the
string and we have incorrectly accpeted this string as matching the pattern.
What should have happend was when the \1 was resolved the second time it
should have returned the same string as it did when the =? matched '=', which
then would have resulted in the tail matching again, and etc, eventually
unwinding the entire pattern when the second iteration failed entirely.
This patch is very crude. It simple pushes the state of the parens and creates
and unwind point for every case where we do a transition to a B or _next
operation, and we make the corresponding _next_fail do the appropriate
unwinding. The objective was to achieve correctness and then work towards
making it more efficient. We almost certainly overstore items on the stack.
In a future patch we can perhaps keep track of the unclosed parens before the
relevant operators and make sure that they are properly pushed and unwound at
the correct times.
Commit: 35efdf22155d1896202bb582455080e133feb470
https://github.com/Perl/perl5/commit/35efdf22155d1896202bb582455080e133feb470
Author: Yves Orton <[email protected]>
Date: 2023-01-15 (Sun, 15 Jan 2023)
Changed paths:
M regcomp.c
M regcomp.h
M regcomp.sym
M regcomp_internal.h
M regexec.c
M regexp.h
M regnodes.h
Log Message:
-----------
regexec.c - make REF into a backtracking state
This way we can do the required paren restoration only when it is in use. When
we match a REF type node which is potentially a reference to an unclosed paren
we push the match context information, currently for "everything", but in a
future patch we can teach it to be more efficient by adding a new parameter to
the REF regop to track which parens it should save.
This converts the backtracking changes from the previous commit, so that it is
run only when specifically enabled via the define RE_PESSIMISTIC_PARENS which
is by default 0. We don't make the new fields in the struct conditional as the
stack frames are large and our changes don't make any real difference and it
keeps things simpler to not have conditional members, especially since some of
the structures have to line up with each other.
If enabling RE_PESSIMISTIC_PARENS fixes a backtracking bug then it means
something is sensitive to us not necessarily restoring the parens properly on
failure. We make some assumptions that the paren state after a failing state
will be corrected by a future successful state, or that the state of the
parens is irrelevant as we will fail anyway. This can be made not true by
EVAL, backrefs, and potentially some other scenarios. Thus I have left this
inefficient logic in place but guarded by the flag.
Commit: f7f5b5acb14e668eed0ab7b6cb209069d7cf6299
https://github.com/Perl/perl5/commit/f7f5b5acb14e668eed0ab7b6cb209069d7cf6299
Author: Yves Orton <[email protected]>
Date: 2023-01-15 (Sun, 15 Jan 2023)
Changed paths:
M pod/perldebguts.pod
M regcomp.c
M regcomp.h
M regcomp.sym
M regcomp_debug.c
M regcomp_study.c
M regcomp_trie.c
M regexec.c
M reginline.h
M regnodes.h
Log Message:
-----------
regex engine - simplify regnode structures and make them consistent
This eliminates the regnode_2L data structure, and merges it with the older
regnode_2 data structure. At the same time it makes each "arg" property of the
various regnode types that have one be consistently structured as an anonymous
union like this:
union {
U32 arg1u;
I32 arg2i;
struct {
U16 arg1a;
U16 arg1b;
};
};
We then expose four macros for accessing each slot: ARG1u() ARG1i() and
ARG1a() and ARG1b(). Code then explicitly designates which they want. The old
logic used ARG() to access an U32 arg1, and ARG1() to access an I32 arg1,
which was confusing to say the least. The regnode_2L structure had a U32 arg1,
and I32 arg2, and the regnode_2 data strucutre had two I32 args. With the new
set of macros we use the regnode_2 for both, and use the appropriate macros to
show whether we want to signed or unsigned values.
This also renames the regnode_4 to regnode_3. The 3 stands for "three 32-bit
args". However as each slot can also store two U16s, a regnode_3 can hold up
to 6 U16s, or as 3 I32's, or a combination. For instance the CURLY style nodes
use regnode_3 to store 4 values, ARG1i() for min count, ARG2i() for max count
and ARG3a() and ARG3b() for parens before and inside the quantifier.
Compare: https://github.com/Perl/perl5/compare/b3a3473eb7fb...f7f5b5acb14e