In perl.git, the branch smoke-me/davem/regexec-refactor has been created
<http://perl5.git.perl.org/perl.git/commitdiff/056767f396f80f1b4623cf917ce8569ae43ceb4d?hp=0000000000000000000000000000000000000000>
at 056767f396f80f1b4623cf917ce8569ae43ceb4d (commit)
- Log -----------------------------------------------------------------
commit 056767f396f80f1b4623cf917ce8569ae43ceb4d
Author: David Mitchell <[email protected]>
Date: Sun Jul 21 11:57:22 2013 +0100
regexec(): add quick-fail test for anchored \G
under anchored \G, e.g. /ab\G/, we know that the start of the match must
be at (ganch-gofs); so fail quickly if that's off the beginning of the
string; or use it as the start point otherwise.
M regexec.c
commit f12bc95a04b9cd382bba56c591444375a577fc8f
Author: David Mitchell <[email protected]>
Date: Sun Jul 21 11:31:21 2013 +0100
regexec: swap ganch setting and gofs offsetting
These two block of code are currently independent of each other, but swap
them round so that the calculated ganch value will be available for more
more clever gofs offset processing.
M regexec.c
commit fae3c0b5f2bc099bcd7ab8dc4d4b279c42efa67e
Author: David Mitchell <[email protected]>
Date: Sat Jul 20 16:16:10 2013 +0100
fix COW match capture optimisation
When an SV matches, a new SV is created which is a COW copy of the original
SV, and stored in prog->saved_copy, then prog->subbeg is set to point to
the (shared) PVX buffer.
Earlier in this branch I introduced an optimisation that skipped freeing
the old saved_copy and creating a new COW SV each time, if the old
saved_copy SV was already a shared copy of the SV being matched.
So far so good, except that I implemented it wrongly: if non-COW
matches (which malloc() subbeg) are interspersed with COW matches,
then the subbeg of the COW and the malloced subbeg get mixed up and
AddressSanitizer throws a wobbly.
The fix is simple: in the optimised branch, we still need to free subbeg
if RXp_MATCH_COPIED is true, then reassign it.
M regexec.c
M t/re/pat.t
commit 2a9ab089ca15191429da11b8ae7de78c58f5afa7
Author: David Mitchell <[email protected]>
Date: Fri Jul 19 22:00:23 2013 +0100
regexec(): avoid uninit use of var
clang pointed out that
if (...)
goto phooey;
oldsave = PL_savestack_ix;
...
phooey:
LEAVE_SCOPE(oldsave);
could use oldsave uninitialised. clang 1, dave 0.
M regexec.c
commit e782b8f50a84e3eb6ced4c4eb02ca3751f231c64
Author: David Mitchell <[email protected]>
Date: Fri Jul 19 21:44:32 2013 +0100
fix /test_bootstrap.t under -DPERL_NO_COW
These tests check whether "require test.pl" inadvertently use $& etc.
They do that by doing a simple pattern match "Perl/ =~ /Perl/, then
checking that eval '$&' returns undef.
This has always been a dodgy thing o rely on. It turns out that under
5.18.0, whether eval '$&' was undef depended on whether the intuit-only-
match codepath was taken. So:
"Perl" =~ /Perl/; eval '$&'; # intuit-only match: returned undef
"Perl" =~ /\w+\w*/; eval '$&'; # regexec() match: returned 'Perl'.
In this branch, the same code path is now used for both intuit() and
regexec() matches, so both return 'Perl'.
So, abandon this approach to the test, and instead read in tets.pl and
grep for the test '$&' etc.
Requires minor fixup to test.pl to avoid a false positive.
M t/porting/test_bootstrap.t
M t/test.pl
commit 5a28c11a342aa53c14a1b12b3cd4931e8cf46c8c
Author: David Mitchell <[email protected]>
Date: Fri Jul 19 20:07:56 2013 +0100
fix build under -DPERL_NO_COW
An earlier commit in this branch fixed up capturing on intuit-only
matches.
However, the new code grabbed the buffer before setting offs[0].start,
offs[0].end. Under old-style non-COW, it uses offs[0].start and end to
determine what subset to the buffer to capture. So set them first!
M regexec.c
commit 4bd0fb0581abfbc53279e3daebef4e06a8585ee9
Author: David Mitchell <[email protected]>
Date: Fri Jul 19 18:12:53 2013 +0100
regexec(): access extflags directly
Some bits of code that had been moved from pp_match() etc into
regexec() still used the external API to access flags, i.e.
RX_EXTFLAGS(rx)
Replace those uses with the more direct
prog->extflags
for consistency with the rest of the code.
M regexec.c
commit 3af02f3e90924c43eed4888a256652df698c7b14
Author: David Mitchell <[email protected]>
Date: Fri Jul 19 17:44:34 2013 +0100
regexec(): tidy up ganch-setting code
Its a bit verbose with tons of debugging statements. Hard to see the wood
for the trees.
M regexec.c
commit ba374dc607656aec2985a11143521611472b37b5
Author: David Mitchell <[email protected]>
Date: Fri Jul 19 17:29:01 2013 +0100
regexec(): merge the 2 RXf_GPOS_SEEN setup blocks
Change
if (RXf_GPOS_SEEN) {
... adjust startpos ...
}
...
if (RXf_GPOS_SEEN) {
... calculate ganch ...
}
to
if (RXf_GPOS_SEEN) {
... adjust startpos ...
... calculate ganch ...
}
...
Should contain no functional changes.
With this commit (building on many previous ones in this branch), all the
setup for \G is now in one place in regexec(), rather than scattered
across various places in pp_match(), regexec() etc.
M regexec.c
commit 44c503ac6ae5e482ae7c45562e18fccd4567aa48
Author: David Mitchell <[email protected]>
Date: Fri Jul 19 17:10:04 2013 +0100
regexec(): simplify RXf_ANCH_GPOS pos calc
There are two bits of code in regexec() that do special handling for
RXf_ANCH_GPOS:
First, after setting ganch from pos(), it does a couple of quick-fail
checks:
fail if s > ganch
fail if (ganch - gofs) < strbeg
at this point it also updates s to be ganch - gofs, although confusingly,
s in not subsequently used.
Second, when about to call regtry, it calculates a new start value
(ignoring the old one, s):
tmps_s = ganch - gofs;
then checks:
fail if tmp_s < strbeg
As can be seen, these two sets of tests essentially partially duplicate
each other.
This commit moves all the work to the second block of code, which
simplifies things, and makes the first block of code purely about
calculating ganch.
Note that the new condition added by this commit in the second block,
fail if s > tmp_s (i.e if s > (ganch - gofs))
subsumes both previous conditions, since
a) it is stronger than s > ganch
b) s will be >= strbeg, so tmp_s >= strbeg
M regexec.c
commit 9f4f36756ae9b03535af6c67486e83c706afb4c1
Author: David Mitchell <[email protected]>
Date: Fri Jul 19 16:37:04 2013 +0100
regexec(): use regtry(&s) not regtry(&startpos)
regexec() has several cases such as anchored, floating etc, and for each
of these it will call regtry() one or more times to match at likely
starting positions. In all cases but one, it calls regtry(&s), where
s is the current start position that moves as we try new positions. In
the one final case it uses startpos instead, which is supposed to be static.
The actual code looks like:
if (s == startpos && regtry(&startpos))
which might seem harmless, but under things like (*COMMIT), regtry can
update the pointer value (which is why its address is taken). So in this
(obscure) case, the wrong pointer gets updated.
M regexec.c
M t/re/pat_advanced.t
commit 5995003ec79f2b0fb6bddba3c6f929c42f80691e
Author: Father Chrysostomos <[email protected]>
Date: Mon Jul 15 18:57:01 2013 -0700
[perl #77814] Make defelems propagate pos
[This commit was earlier reverted to make rebasing easier; its now
added back in its original form, except that the changes to pp_hot.c have
been reapplied by hand to account for the significant changes that have
taken place in pp_match() since then.]
When elements of @_ refer to nonexistent hash or array elements, then
the magic scalar in $_[0] delegates all set/get actions to the element
in represents, vivifying it if needed.
pos($_[0]), however, was not delegating the value to the element, but
storing it on the magical âdeferred elementâ scalar.
M embed.fnc
M embed.h
M mg.c
M pp.c
M pp_ctl.c
M pp_hot.c
M proto.h
M regexec.c
M sv.c
M t/op/pos.t
commit b1ecde2f26f0c5b9ab7b30d078d38a5d6b2d14a7
Author: David Mitchell <[email protected]>
Date: Tue Jul 16 16:31:04 2013 +0100
s/.(?=.\G)/X/g: refuse to go backwards
On something like:
$_ = "123456789";
pos = 6;
s/.(?=.\G)/X/g;
each iteration could in theory start with pos one character to the left
of the previous position, and with the substitution replacing bits that
it has already replaced. Since that way madness lies, ban any attempt by
s/// to substitute to the left of a previous position.
To implement this, add a new flag to regexec(), REXEC_FAIL_ON_UNDERFLOW.
This tells regexec() to return failure even if the match itself succeeded,
but where the start of $& is before the passed stringarg point.
This change caused one existing test to fail (which was added about a year
ago):
$_="abcdef";
s/bc|(.)\G(.)/$1 ? "[$1-$2]" : "XX"/ge;
print; # used to print "aXX[c-d][d-e][e-f]"; now prints "aXXdef"
I think that that test relies on ambiguous behaviour, and that my change
makes things saner.
Note that s/// with \G is generally very under-tested.
M pod/perlre.pod
M pp_ctl.c
M pp_hot.c
M regexec.c
M regexp.h
M t/re/subst.t
commit 0f55c4ebf6b57d27a0e445024174c647313a8844
Author: David Mitchell <[email protected]>
Date: Mon Jul 15 21:57:34 2013 +0100
pp_subst: don't use REXEC_COPY_STR on 2nd match
pp_subst() sets the REXEC_COPY_STR flag on the first match. On the second
and subsequent matches, it doesn't set it in two out three of the branches
(including pp_susbstcont) where it calls CALLREGEXEC().
The one place where it *does* set it is a (harmless) mistake, since regexec
ignores REXEC_COPY_STR if REXEC_NOT_FIRST is set (which is it is, on all 3
brnanches).
So unset REXEC_COPY_STR in the third branch too, for consistency
M pp_hot.c
commit d9b16f176fb06efca47808db3c55621952f37252
Author: David Mitchell <[email protected]>
Date: Mon Jul 15 21:24:02 2013 +0100
pp_subst: combine 3 small elsif blocks into 1
and slightly reduce the scope of the temporary i var.
M pp_hot.c
commit 3fa4b5338a2c73bd462c7fbb8766027ccd749cfb
Author: David Mitchell <[email protected]>
Date: Mon Jul 15 21:10:47 2013 +0100
pp_subst: remove one use of 'm' local var
M pp_hot.c
commit 0811b84722a3f0f0f6bf88f98b10d9cdb8439f08
Author: David Mitchell <[email protected]>
Date: Mon Jul 15 21:00:49 2013 +0100
pp_subst: reduce scope of 'i' variable
it's just used a temporary var in a few blocks; declare it individually
in each block rather than being scoped to the whole function.
M pp_hot.c
commit 26d3a05f731ae584823b58ab466e4e214b25871c
Author: David Mitchell <[email protected]>
Date: Mon Jul 15 20:37:44 2013 +0100
pp_subst: reduce scope of 'm' var
its mainly just a temporary local var; declare it individually within each
scope that makes use of it.
M pp_hot.c
commit 36cb855d42271561da9eb89820a1e607ebb0c154
Author: David Mitchell <[email protected]>
Date: Mon Jul 15 20:17:51 2013 +0100
pp_subst: set/use s,m vars near where they're used
This should be just a cosmetic change; but basically change stuff like
m = orig;
s = foo();
... lots of lines not using s or m ...
bar(m,s)
... more stuff using s ...
to
... lots of lines not using s or m ...
s = foo();
bar(orig,s)
... more stuff using s ...
This is part of few commits to generally clean up the scope and
comprehensibility of the vars within pp_subst
M pp_hot.c
commit 9847d54a2eae21197f060e1e92e94d86cb5c96bf
Author: David Mitchell <[email protected]>
Date: Mon Jul 15 19:54:53 2013 +0100
pp_subst: reduce scope of 'd' variable
It's just used as a temporary value in two branches;
so make it a local var in each of those branches.
M pp_hot.c
commit 84b08f6d0af0027a3e5df71b970f53f0d5a4e266
Author: David Mitchell <[email protected]>
Date: Mon Jul 15 19:16:10 2013 +0100
pp_subst: cosmetic re-arrangement of vars
since 'orig' always points to the start of the string, while 's' varies,
change
s = SvPV_nomg(...);
...other stuff using value of s ...
orig = s
...
to
orig = SvPV_nomg(...);
...other stuff using value of orig ...
s = orig
...
No functional change, just reduces the cognitive load slightly
also adds some comments as to what force_on_match is about.
M pp_hot.c
commit 5411d224edceeccf125f41c24842ab35a6741059
Author: David Mitchell <[email protected]>
Date: Sat Jul 13 21:18:50 2013 +0100
regexec(): fix ganch and till settings
Since startpos is now the \G-adjusted start position, use the real start
position instead (stringarg) when setting reginfo->till, and when setting
ganch in the non-pos case.
This stops this infinitely looping:
$_ = "x"; pos = 1; @a = /x\G/g
M regexec.c
M t/re/pat.t
commit 5614d02439812f968541908f18925cdb4c01b114
Author: David Mitchell <[email protected]>
Date: Sat Jul 13 20:16:19 2013 +0100
regexec(): skip second intuit() call
A few commits ago, the call to intuit() done by the *callers* of
regexec() was moved into regexec() itself. Since regexec() could also call
intuit(), this temporarily led to the situation where intuit() was
harmlessly but inefficiently called twice. The last few commits have
removed the subtle differences between the conditions for each of the two
call points, so the second call to intuit() can now be removed.
A consequence of this is that we have to adjust the usage of the
'startpos' verses 's' variables; the original intent was that
startpos was constant, while s moved forward in the string after intuit
etc. This got a bit lost during the recent reorganisation, but is now
re-established. (startpos isn't quite constant: it will contain any
initial adjustment for \G.)
M regexec.c
commit 4364281aba77d2514fbbf9d3f4075b761c1a03fc
Author: David Mitchell <[email protected]>
Date: Fri Jul 19 02:08:56 2013 +0100
fix intuit_start() with \G
Intuit assumed that any anchor, including \G, anchored at BOS or after \n.
This obviously isn't the case for \G, so exclude RXf_ANCH_GPOS from the
RXf_ANCH branch.
This has never been spotted before, since intuit used to be skipped when
\G was present.
M regexec.c
M t/re/pat.t
commit 43899d2fe37432a96491c1abc97bfe70beeac1a8
Author: David Mitchell <[email protected]>
Date: Sat Jul 13 15:23:59 2013 +0100
enable intuit under anchored \G, and fix a bug
Since 1999, regcomp has had approximately the following comment and code:
/* XXXX Currently intuiting is not compatible with ANCH_GPOS.
This should be changed ASAP! */
if ((r->check_substr || r->check_utf8) && !(r->extflags &
RXf_ANCH_GPOS)) {
r->extflags |= RXf_USE_INTUIT;
....
However, it appears that since that time, intuit has had (at least some)
support for achored \G added.
Note also that the RXf_USE_INTUIT flag (up until a few commits go)
was only used by *callers* of regexec() to decide whether to call intuit()
first; regexec() itself also internally calls intuit() on occasion, and in
those cases it directly checks just the check_substr and check_utf8 fields,
rather than the RXf_USE_INTUIT flag; so in those cases it's using intuit
even in the presence of anchored \G.
So, in the grand perl tradition of "make the change and see if anything
in the test suite breaks", that's what I've done for this commit
(i.e. removed the RXf_ANCH_GPOS check above).
So intuit is now normally called even in the presence of anchored \G.
This means that something like "aaaa" =~ /\G.*xx/ will now quickly fail in
intuit rather than more slowly failing in regmatch().
Note that I have no actual knowledge of whether intuit is *really*
anchored-\G-safe.
As it happens one thing in the test suite did break, and this was due to
the following code, added back in 1997:
if (
....
&& !((RExC_seen & REG_SEEN_GPOS) || (r->extflags & RXf_ANCH_GPOS)))
)
r->extflags |= RXf_CHECK_ALL;
It was clearly meant to say that if either of those \G flags were present,
don't set the RXf_CHECK_ALL flag (which enables intuit-only matches).
But the '!' was set to cover the first condition only, rather than both.
Presumably this had never been spotted before due to skipping intuit under
anchored \G.
[Actually this commit broke some other stuff too, not covered by the test
suite. See the next commit. Hooray for git rebase -i and history
re-writing!]
M regcomp.c
commit 2d2087d61956b2cea17bdfbea140555b4b359b60
Author: David Mitchell <[email protected]>
Date: Wed Jul 10 20:00:22 2013 +0100
regexec_flags(): remove vestigial scream support
intuit has an arg (data) that used to be used for scream stuff, but which
is now unused. However, Perl_regexec_flags() still went to the trouble of
setting up that parameter when calling intuit. So stop doing that.
M regexec.c
commit 0cf6c0c74c5601309d4423c42ee0a5dd2e8aa8ca
Author: David Mitchell <[email protected]>
Date: Wed Jul 10 14:28:02 2013 +0100
regexec_flags(): keep stringarg constant
stringarg is the arg passed to Perl_regexec_flags() to indicate where to
start matching. Currently the code adjusts this under \G, then copies it
to startpos, then later tinkers with startpos further.
Change it so that stringarg is never changed, and all the adjusting is to
startpos. Shouldn't make any logical difference, but makes the code
slightly cleaner and easier to understand (and doesn't require minend to
be adjusted any more).
M regexec.c
commit 7c57745e307dd7c7b55445ca55b8afd8663b53d3
Author: David Mitchell <[email protected]>
Date: Wed Jul 10 13:35:51 2013 +0100
regexec_flags(): use result of intuit_start()
When I moved the call to re_intuit_start() into Perl_regexec_flags()
a few commits earlier, I assigned the return value to the wrong variable,
so a subsequent match would still start at the beginning, not at the
intuited start point. This commit corrects that, by updating startpos
rather than stringarg.
M regexec.c
commit 09b5e72a7d05e0ef3d05564c63558d1652c1a4a3
Author: David Mitchell <[email protected]>
Date: Wed Jul 10 11:13:38 2013 +0100
pp_match: simplify pos()-getting code
The previous commit removed the \G handling from pp_match; most of what's
left in that code block is redundant code that just sets curpos under all
conditions. So tidy it up.
M pp_hot.c
commit fb183b313a09dfe5a58ea247c3d6541851044132
Author: David Mitchell <[email protected]>
Date: Sun Jun 23 13:30:59 2013 +0100
regexec: handle \G ourself, rather than in callers
Normally a /g match starts its processing at the previous pos() (or at
char 0 if pos is not set); however in the case of something like /abc\G/
we actually need to start 3 characters before pos. This has been handled
by the *callers* of regexec() subtracting prog->gofs from the stringarg
arg before calling it, or by setting stringarg to strbeg for floating,
such as /\w+\G/.
This is clearly wrong: the callers of regexec() shouldn't need to worry
about the details of getting \G right: move this code into regexec()
itself.
(Note that although this commit passes all tests, it quite possibly isn't
logically correct. It will get fixed up further during the next few
commits)
M pp_ctl.c
M pp_hot.c
M regexec.c
M regexp.h
commit 8e35c21509360098e30b33909ce0d87c80eab5e8
Author: Yves Orton <[email protected]>
Date: Sun Sep 16 14:25:02 2012 +0200
fix 114884 positive GPOS lookbehind regex substitution failure
This also corrects a test added in 2c2969659ae1c534e7f3fac9e7a7d186defd9943
which was
arguably wrong. The details of \G are a bit fuzzy, and IMO its a little
hard to say exactly
what is right, although it generally is clear what is wrong.
M pp_ctl.c
M t/re/subst.t
commit 6d8b498b23a62b2005c80611b2dac9d64e92a0d8
Author: David Mitchell <[email protected]>
Date: Sat Jun 22 17:24:13 2013 +0100
pp_match(): don't set REXEC_IGNOREPOS on 1st iter
Currently all core callers of regexec set both the
REXEC_IGNOREPOS and REXEC_NOT_FIRST flags, or neither, depending
on whether this is the first or subsequent iteration of a //g;
*except* for one place in pp_match(), where REXEC_IGNOREPOS is set
on the first iteration for the one specific case of /g with an anchored
\G.
Now AFAICT this makes no difference, because the starting position
as calculated by regexec() still comes to the same value of
(strbeg + pos -gofs), and the same value og ganch calculated.
Also in the commit that added this particular use of the flag to pp_match,
(0ef3e39ecdfec), removing the flag makes no difference to the passing or
not of the new test case.
So I don't understand what its purpose it, and its possibly a mistake.
Removing it now makes the code simpler for further clearup.
M pp_hot.c
commit 29e09a4f263f00c00db16ea75fd318556c787005
Author: David Mitchell <[email protected]>
Date: Fri Jun 21 21:44:45 2013 +0100
pp_match(): stop setting $-[0] before regexec()
It doesn't actually achieve anything.
M pp_hot.c
commit 21acf47063e20a4812c5b316937f4f1503e00c32
Author: David Mitchell <[email protected]>
Date: Fri Jun 21 20:16:30 2013 +0100
pp_match: avoid setting $+[0]
This function sometimes set $+[0] to pos() before calling regexec().
This value isn't used by regexec(), and was really just a way of updating
the new start position for //g. Replace it with a local var instead.
M pp_hot.c
commit 8eb3a2d36f3baf76de4648dc5414d73b96f711ad
Author: David Mitchell <[email protected]>
Date: Fri Jun 21 20:00:01 2013 +0100
pp_match(): eliminate unused t variable
and restrict usage of s variable
M pp_hot.c
commit 4f9ff9621ec88e45707a55721984745d5b467a5a
Author: David Mitchell <[email protected]>
Date: Thu Jun 20 14:54:44 2013 +0100
pp_match(): skip passing gpos arg to regexec()
In one specific case, pp_match() passes the value of pos() to regexec()
via the otherwise unused 'data' arg.
It turns out that pp_match() only passes this value when it exists and is
>= 0, while regexec() only uses it when there's no pos magic or pos() < 0.
So its never used as far as I can tell.
So, strip it for now.
M pp_hot.c
M regexec.c
commit 434ebeb7a0b82fe65f898038e6415a775f60f4a0
Author: David Mitchell <[email protected]>
Date: Thu Jun 20 14:22:42 2013 +0100
add some basic floating /\G/ tests
Floating is when the \G is an unknown number of characters from the start
of the pattern, such as /a+\G/. Surprisingly, there were no tests for this
form.
Here are a few basic tests just to exercise the main code paths. More
comprehensive tests could do with being added at some point.
M t/re/pat.t
commit 420f6ff065f94365006be245972d203b614e9000
Author: David Mitchell <[email protected]>
Date: Thu Jun 20 13:33:31 2013 +0100
fix /.\G/ under threading
When a regex was being duped, it's (constant) gofs field wasn't being
copied, but rather was being set to zero. Skip this and lots of TODO tests
pass.
M regcomp.c
M t/re/pat.t
commit 620b9a9e5d8e36a1d5c246513bdfbfce4f6187aa
Author: David Mitchell <[email protected]>
Date: Wed Jun 19 12:44:41 2013 +0100
skip creating new capture COW SV if possible
Each time we do a match, we currently (where possible) make a COW copy of
the just-matched string. This involves creating a new SV that shares the
same PVX buffer with the string. In a repeated match like while (/.../g),
that means the each time round we free the old capture SV and create a new
one.
As as optimisation, skip the free/create if the old capture SV is already
a COW clone of the match string.
M regexec.c
commit 78ddd2085524d3bb7b5a7f9abe87d4d0ffa1f0cb
Author: David Mitchell <[email protected]>
Date: Tue Jun 18 16:34:43 2013 +0100
make Perl_reg_set_capture_string static
This function was introduced a few commits ago. Since it's now only
called from within regexec.c, make it static.
M embed.fnc
M embed.h
M proto.h
M regexec.c
commit af31338cc0a47447e470533f54781cfd4880c202
Author: David Mitchell <[email protected]>
Date: Tue Jun 18 16:17:39 2013 +0100
add intuit-only match to s///
pp_match() has an intuit-only match mode: if intuit_start() succeeds and
the regex is marked as only needing intuit (RXf_CHECK_ALL), then calling
regexec() is skipped, and just $& set and then returns.
The commit which originally added that feature to pp_match() also added a
comment to pp_subst() suggesting that the same thing could be done there.
This commit finally achieves that. It builds on the previous commit (which
moved this mechanism from pp_match() directly into regexec()), skipping
calling intuit_start() and directly calling regexec() with the
REXEC_CHECKED flag not set.
This appears to reduce the execution time of a simple substitution
like s/abc/def/ by a fifth.
M pp_hot.c
commit 27e84fb23f529f8bc4407f2d054b2c0fd3aec1b9
Author: David Mitchell <[email protected]>
Date: Tue Jun 18 14:44:12 2013 +0100
move intuit call from pp_match() into regexec()
Currently the main part of pp_match() looks like:
if (can_use_intuit) {
if (!intuit_start())
goto nope;
if (can_match_based_only_on_intuit_result) {
... set up $&, $-[0] etc ...
goto gotcha;
}
}
if (!regexec(..., REXEC_CHECKED|r_flags))
goto nope;
gotcha:
...
This rather breaks the regex API encapulation. The caller of the regex
engine shouldn't have to worry about whether to call intuit() or
regexec(), and to know to set $& in the intuit-only case.
So, move all the intuit-calling and $& setting into regexec itself.
This is cleaner, and will also shortly allow us to enable intuit-only
matches in pp_subst() too. After this change, the code above looks like
(in its entirety):
if (!regexec(..., r_flags))
goto nope;
...
There, isn't that nicer?
M pp_hot.c
M regexec.c
commit a11044fc16571a1d523f7907e46c98b72e1ed86d
Author: David Mitchell <[email protected]>
Date: Tue Jun 18 12:29:16 2013 +0100
make intuit_start() handle mixed utf8-ness
Fix a bug in intuit_start() that makes it fail when the utf8-ness of the
string and pattern differ. This was mostly masked, since pp_match() skips
calling intuit in this case (and has done since 2000, presumably as a
workaround for this issue, and possibly for other issues since fixed).
But pp_subst() didn't skip, so code like this would fail:
$c = "\x{c0}";
utf8::upgrade($c);
print "ok\n" if $c =~ s/\xC0{1,2}$/\xC0/i;
Now that intuit is (hopefully) fixed, also remove the guard in pp_match().
M pp_hot.c
M regexec.c
commit b92075f8fb316b625ecd24ab8d902cae1fb17411
Author: David Mitchell <[email protected]>
Date: Mon Jun 17 17:38:41 2013 +0100
pp_match(): fix UTF* match setting
A recent commit did RX_MATCH_UTF8_set() based on the utf8-ness of the
pattern rather than the match string. I didn't matter because in that
branch they were guaranteed to have the same value, but fix it anyway,
both for correctness sake, and because it it *will* matter shortly
M pp_hot.c
commit e1858333794959974e53fa9d07cd67276124cd09
Author: David Mitchell <[email protected]>
Date: Sun Jun 16 16:54:09 2013 +0100
pp_match(): intuit can handle refs these days
It looks like we no longer need to skip intuit-only matching when the
match is a ref or overloaded (e.g. $ref =~ /ARRAY/)
M pp_hot.c
commit 59488c4da9eaef3830ba7a4de1c295a49d06b0f3
Author: David Mitchell <[email protected]>
Date: Sun Jun 16 16:09:07 2013 +0100
pp_match(): remove ret_no label
The nope: and ret_no: labels labelled the same point in the code.
Eliminate one of them.
M pp_hot.c
commit 6cff35687de688126666ebd66c5cbfee00a94a78
Author: David Mitchell <[email protected]>
Date: Sun Jun 16 16:01:22 2013 +0100
pp_match(): combine intuit and regexec branches
There was some code that looked roughly like:
if (can_match_on_intuit_only) {
....
goto yup;
}
if (!regexec())
goto ret_no;
gotcha:
A; B;
if (simple)
RETURNYES;
X; Y;
RETURN;
yup:
A;
if (!simple)
goto gotcha;
B;
RETURNYES
Refactor it to look like
if (can_match_on_intuit_only) {
....
goto gotcha;
}
if (!regexec())
goto ret_no;
gotcha:
A; B;
if (simple)
RETURNYES;
X; Y;
RETURN;
As well as simplifying the code, it also avoids duplicating some work
(the 'A' above was done twice sometimes) - harmless but less efficient.
M pp_hot.c
commit fdbfd936e8402d5e1da2a1aaad0fc191b82b697d
Author: David Mitchell <[email protected]>
Date: Sun Jun 16 15:45:20 2013 +0100
pp_match(): refactor intuit-only code
change
if (intuit_only)
goto yup:
...
yup:
A; B; X; Y;
to
if (intuit_only)
A; B;
goto yup:
...
yup:
X; Y;
where A and B are intuit_only-specific steps while X and Y are done by the
regexec() branch too. This will shortly allow us to merge the two
branches.
M pp_hot.c
commit bdbc9fd600c0824feedace694fd28bee093a2757
Author: David Mitchell <[email protected]>
Date: Sun Jun 16 15:38:56 2013 +0100
pp_match(): minor refactor: consolidate RETPUSHYES
Make the code slightly simpler by doing an early RETPUSHYES after success
where possible.
M pp_hot.c
commit bf86bda065594a1593cc4009d4eadecd49181021
Author: David Mitchell <[email protected]>
Date: Sun Jun 16 14:27:19 2013 +0100
pp_match(): factor out some common code
Some identical code is used in two separate branches to set pos()
after a successful match. Hoist the common code to above the branch.
M pp_hot.c
commit da2e45b5582b98dc718e67dd48f1f026e5b078c6
Author: David Mitchell <[email protected]>
Date: Sun Jun 16 13:26:30 2013 +0100
re-enable intuit-only matches
The COW changes inadvertently disabled intuit-only matches.
These are where calling intuit_start() to find the starting point for a
match is enough to know that the whole pattern will match, and so you can
skip calling regexec() too. For example, fixed strings without captures
such as /abc/.
The COW breakage meant that regexec was always called, making something
like /abc/ abut 3 times slower.
This commit re-enables intuit-only matches.
However, it turns out that this opens up a can of worms.
Normally, recording the just-matched-against string so that things like $&
and captures work, is done within regexec(). When this is skipped,
pp_match has to do a similar thing itself. The code that does this (which
is in principle a copy of the code in regexec()) is a bit of a mess. Due
to a logic error, a big chunk of it has actually been dead code for 10+
years. Its had lots of modifications (e.g. people have made the same
changes to regexec() and pp_match()), but since it never gets executed,
errors aren't detected. And the bits that are executed haven't completely
received all the COW and SAWAMERSAND updates that have happened recently.
The Best way to fix this is is to extract out the capture code in
regexec() into a separate function (which we did in the previous commit),
then throw away all the broken capture code in pp_match() and replace it
with a call to the new function (which this commit does).
One side effect of this commit is that as well as restoring intuit-only
behaviour for the patterns that used to pre-COW, it also enables this
behaviour for patterns which formerly didn't, namely where $& or //p are
seen.
This commit is the barest minimum necessary to fix this; subsequent
commits will clean and improve this.
M pp_hot.c
commit 5d0cb02a96baee84f84d8f4e97b7e43198e95aa3
Author: David Mitchell <[email protected]>
Date: Sat Jun 15 17:54:10 2013 +0100
add Perl_reg_set_capture_string() function
Cut and paste into a separate function, the block of code in
regexec_flags() that is responsible (on successful match) for setting
RX_SAVED_COPY, RX_SUBBEG etc, ready for use by capture vars like $1, $&.
Although this function is currently only called from one place, we will
shortly use it elsewhere too.
This should contain no functional changes.
M embed.fnc
M embed.h
M proto.h
M regexec.c
commit eb6a0fdf794a6c95638135b4ce0c13ab19d0cfb9
Author: David Mitchell <[email protected]>
Date: Tue Jul 16 17:14:57 2013 +0100
Revert "[perl #77814] Make defelems propagate pos"
Temporarily revert this commit within this branch, to make rebasing
easier. The contents of this commit will be re-applied later.
This reverts commit 96c2a8ff507ccc5e4a6d00051b23e7a73d844322.
M embed.fnc
M embed.h
M mg.c
M pp.c
M pp_ctl.c
M pp_hot.c
M proto.h
M regexec.c
M sv.c
M t/op/pos.t
-----------------------------------------------------------------------
--
Perl5 Master Repository