In perl.git, the branch smoke-me/davem-regex-buffer-copy has been created
<http://perl5.git.perl.org/perl.git/commitdiff/2bcd9b62481f3d1cdcd60ef3e7b263485761ddb3?hp=0000000000000000000000000000000000000000>
at 2bcd9b62481f3d1cdcd60ef3e7b263485761ddb3 (commit)
- Log -----------------------------------------------------------------
commit 2bcd9b62481f3d1cdcd60ef3e7b263485761ddb3
Author: David Mitchell <[email protected]>
Date: Fri Sep 7 13:32:11 2012 +0100
fix a bug in handling $+[0] and unicode
The code to decide what substring of a pattern target to copy for the
sake of $1, $& etc, would, in the absence of $&, only copy the minimum
range needed to cover $1,$2,...., which might be a shorter range than
what $& covers. This is fine most of the time, but, when calculating
$+[0] on a unicode string, it needs a copy of the whole part of the string
covered by $&, since it needs to convert the byte offest into a char
offset.
So to fix this, always copy as a minimum, the $& range.
I suppose we could be more clever about this: detect the presence
of @+ in the code, only do it for UTF8 etc; but this is simple
and non-fragile.
M regexec.c
M t/re/re_tests
commit 500949cafaf334272b21a89bf044bc3bcf0cd4b3
Author: David Mitchell <[email protected]>
Date: Sat Sep 1 11:43:53 2012 +0100
m// and s///; don't copy TEMP/AMAGIC strings
Currently pp_match and pp_subst make a copy of the match string if it's
SvTEMP(), and in the case of pp_match, also if it's SvAMAGIC().
This is no longer necessary, as the code will always copy the string
anyway if its actually needed after the match, i.e. if it detects the
presence of $1, $& or //p etc. Until a few commits ago, this wasn't the
case for pp_match: it would sometimes skip copying even in the presence of
$1 et al for efficiency reasons. Now that that's fixed, we can remove the
SvTEMP() and SvAMAGIC() tests.
As to why pp_subst did the SvTEMP test, I don't know: but removing it
didn't make any tests fail!
M pp_hot.c
commit ed17c6e0c4437f531375c687e04038aac0a3fde5
Author: David Mitchell <[email protected]>
Date: Sat Sep 1 11:23:58 2012 +0100
tidy up patten match copying code
(no functional changes).
1. Remove some dead code from pp_split; it's protected by an assert
that it could never be called.
2. Simplify the flags settings for the call to CALLREGEXEC() in
pp_substcont: on subsequent matches we always set REXEC_NOT_FIRST,
which forces the regex engine not to copy anyway, so passing the
REXEC_COPY_STR is pointless, as is the conditional code to set it.
3. (whitespace change): split a conditional expression over 2 lines
for easier reading.
M pp.c
M pp_ctl.c
M pp_hot.c
commit 17982ef0918a9bdd16fa1cba80777dfe7826eb89
Author: David Mitchell <[email protected]>
Date: Fri Aug 24 16:17:47 2012 +0100
stop $foo =~ /(bar)/g skipping copy
Normally in the presence of captures, a successful regex execution
makes a copy of the matched string, so that $1 et al give the right
value even if the original string is changed; i.e.
$foo =~ /(123)/g;
$foo = "bar";
is("$1", "123");
Until now that test would fail, because perl used to skip the copy for
the scalar /(...)/g case (but not the C<$&; //g> case). This was to
avoid a huge slowdown in code like the following:
$x = 'x' x 1_000_000;
1 while $x =~ /(.)/g;
which would otherwise end up copying a 1Mb string a million times.
Now that (with the last commit but one) we copy only the required
substring of the original string (a 1-byte substring in the above
example), we can remove this fast-but-incorrect hack.
M pp_hot.c
M t/re/pat_advanced.t
M t/re/pat_psycho.t
commit 90858c98ad7988657fadf49da91025067b38d4a9
Author: David Mitchell <[email protected]>
Date: Fri Aug 24 15:49:21 2012 +0100
rationalise t/re/pat_psycho.t
Do some cleanup of this file, without changing its functionality.
Once upon a time, the psycho tests were scattered throughout a single
pat.t file, before being moved into their own file. Now that they're all
in a single file, make the $PERL_SKIP_PSYCHO_TEST test a single "skip_all"
test at the beginning of the file, rather than testing it separately in
each code block.
Also, make some of the test descriptions more useful, and add a bit of
debugging output.
M t/re/pat_psycho.t
commit bbe94cee54bc43dc7062e050662b3897c85af61b
Author: David Mitchell <[email protected]>
Date: Thu Jul 26 16:04:09 2012 +0100
Don't copy all of the match string buffer
When a pattern matches, and that pattern contains captures (or $`, $&, $'
or /p are present), a copy is made of the whole original string, so
that $1 et al continue to hold the correct value even if the original
string is subsequently modified. This can have severe performance
penalties; for example, this code causes a 1Mb buffer to be allocated,
copied and freed a million times:
$&;
$x = 'x' x 1_000_000;
1 while $x =~ /(.)/g;
This commit changes this so that, where possible, only the needed
substring of the original string is copied: in the above case, only a
1-byte buffer is copied each time. Also, it now reuses or reallocs the
buffer, rather than freeing and mallocing each time.
Now that PL_sawampersand is a 3-bit flag indicating separately whether
$`, $& and $' have been seen, they each contribute only their own
individual penalty; which ones have been seen will limit the extent to
which we can avoid copying the whole buffer.
Note that the above code *without* the $& is not currently slow, but only
because the copying is artificially disabled to avoid the performance hit.
The next but one commit will remove that hack, meaning that it will still
be fast, but will now be correct in the presence of a modified original
string.
We achieve this by by adding suboffset and subcoffset fields to the
existing subbeg and sublen fields of a regex, to indicate how many bytes
and characters have been skipped from the logical start of the string till
the physical start of the buffer. To avoid copying stuff at the end, we
just reduce sublen. For example, in this:
"abcdefgh" =~ /(c)d/
subbeg points to a malloced buffer containing "c\0"; sublen == 1,
and suboffset == 2 (as does subcoffset).
while if $& has been seen,
subbeg points to a malloced buffer containing "cd\0"; sublen == 2,
and suboffset == 2.
If in addition $' has been seen, then
subbeg points to a malloced buffer containing "cdefgh\0"; sublen == 6,
and suboffset == 2.
The regex engine won't do this by default; there are two new flag bits,
REXEC_COPY_SKIP_PRE and REXEC_COPY_SKIP_POST, which in conjunction with
REXEC_COPY_STR, request that the engine skip the start or end of the
buffer (it will still copy in the presence of the relevant $`, $&, $',
/p).
Only pp_match has been enhanced to use these extra flags; substitution
can't easily benefit, since the usual action of s///g is to copy the
whole string first time round, then perform subsequent matching iterations
against the copy, without further copying. So you still need to copy most
of the buffer.
M dump.c
M ext/Devel-Peek/t/Peek.t
M mg.c
M pod/perlreapi.pod
M pp.c
M pp_ctl.c
M pp_hot.c
M regcomp.c
M regexec.c
M regexp.h
M t/porting/known_pod_issues.dat
M t/re/re_tests
commit a8e569b8c2d47e53f6a3260ff9185067ec5fcc9e
Author: David Mitchell <[email protected]>
Date: Thu Jul 26 15:35:39 2012 +0100
Separate handling of ${^PREMATCH} from $` etc
Currently the handling of getting the value, length etc of ${^PREMATCH}
etc is identical to that of $` etc.
Handle them separately, by adding RX_BUFF_IDX_CARET_PREMATCH etc
constants to the existing RX_BUFF_IDX_PREMATCH set.
This allows, when retrieving them, to always return undef if the current
match didn't use //p. Previously the result depended on stuff such
as whether the (non-//p) pattern included captures or not.
The documentation for ${^PREMATCH} etc states that it's only guaranteed to
return a defined value when the last pattern was //p.
As well as making things more consistent, this is a necessary
prerequisite for the following commit, which may not always copy the
whole string during a non-//p match.
M mg.c
M regcomp.c
M regexp.h
M t/re/reg_pmod.t
commit df07e6993146350d6dd4861c50645669548fc2ea
Author: David Mitchell <[email protected]>
Date: Fri Jun 22 16:26:08 2012 +0100
regexec_flags(): simplify length calculation
The code to calculate the length of the string to copy was
PL_regeol - startpos + (stringarg - strbeg);
This is a hangover from the original (perl 3) regexp implementation
that under //i, copied and folded the original buffer: so startpos might
not equal stringarg. These days it always is (except under a match failure
with (*COMMIT), and the code we're interested is only executed on success).
So simplify to just PL_regeol - strbeg.
M regexec.c
commit baf273fedf22ce9ef32eca5765e6f42ce53dea51
Author: David Mitchell <[email protected]>
Date: Fri Jun 22 12:36:03 2012 +0100
PL_sawampersand: use 3 bit flags rather than bool
Set a separate flag for each of $`, $& and $'.
It still works fine in boolean context.
This will allow us to have more refined control over what parts
of a match string to copy (we currently copy the whole string).
M gv.c
M intrpvar.h
M perl.c
M perl.h
commit 15be01387bf616ecc45e30e8731ef8546d71c3fb
Author: David Mitchell <[email protected]>
Date: Wed Jun 20 14:17:05 2012 +0100
document args to regexec_flags and API
Document in the API, and clarify in the source code, what the arguments
to Perl_regexec_flags are.
NB: this info is based on code inspection, not any real knowledge on my
part.
M pod/perlreapi.pod
M regexec.c
-----------------------------------------------------------------------
--
Perl5 Master Repository