In perl.git, the branch blead has been updated <http://perl5.git.perl.org/perl.git/commitdiff/1703c1fc996f9d5943ebada0759bc5212289ae8e?hp=2d2826733b14efb7509c9c0c28d27bca6f31d681>
- Log ----------------------------------------------------------------- commit 1703c1fc996f9d5943ebada0759bc5212289ae8e Merge: 2d28267 3de645a Author: David Mitchell <[email protected]> Date: Sat Sep 8 15:42:56 2012 +0100 [MERGE] only copy bits of regex match string When making a copy of the string being matched against (so that $1, $& et al continue to show the correct value even if the original string is subsequently modified), only copy that substring of the original string needed for the capture variables, rather than copying the whole string. This is a big win for code like $&; $_ = 'x' x 1_000_000; 1 while /(.)/; Also, when pessimizing if the code contains $`, $& or $', record the presence of each variable separately, so that the determination of the substring range is based on each variable separately. So performance-wise, $&; /x/ is now roughly equivalent to /(x)/ whereas previously it was like /^(.*)(x)(.*)$/ and $&; $'; /x/ is now roughly equivalent to /(x)(.*)$/ etc. Finally, this code (when not in the presence of $& etc) $_ = 'x' x 1_000_000; 1 while /(.)/; used to skip the buffer copy for performance reasons, but suffered from $1 etc changing if the original string changed. That's now been fixed too. commit 3de645a82921698b4886d748e3a5a5ed98752f42 Author: David Mitchell <[email protected]> Date: Fri Sep 7 13:32:11 2012 +0100 fix a bug in handling $+[0] and unicode The code to decide what substring of a pattern target to copy for the sake of $1, $& etc, would, in the absence of $&, only copy the minimum range needed to cover $1,$2,...., which might be a shorter range than what $& covers. This is fine most of the time, but, when calculating $+[0] on a unicode string, it needs a copy of the whole part of the string covered by $&, since it needs to convert the byte offest into a char offset. So to fix this, always copy as a minimum, the $& range. I suppose we could be more clever about this: detect the presence of @+ in the code, only do it for UTF8 etc; but this is simple and non-fragile. M regexec.c M t/re/re_tests commit 6f1854a1fe246f8633ccd4d455563cb4210ceb39 Author: David Mitchell <[email protected]> Date: Sat Sep 1 11:43:53 2012 +0100 m// and s///; don't copy TEMP/AMAGIC strings Currently pp_match and pp_subst make a copy of the match string if it's SvTEMP(), and in the case of pp_match, also if it's SvAMAGIC(). This is no longer necessary, as the code will always copy the string anyway if its actually needed after the match, i.e. if it detects the presence of $1, $& or //p etc. Until a few commits ago, this wasn't the case for pp_match: it would sometimes skip copying even in the presence of $1 et al for efficiency reasons. Now that that's fixed, we can remove the SvTEMP() and SvAMAGIC() tests. As to why pp_subst did the SvTEMP test, I don't know: but removing it didn't make any tests fail! M pp_hot.c commit fbfb1899dd79253696b441cc1c4968a1057c2574 Author: David Mitchell <[email protected]> Date: Sat Sep 1 11:23:58 2012 +0100 tidy up patten match copying code (no functional changes). 1. Remove some dead code from pp_split; it's protected by an assert that it could never be called. 2. Simplify the flags settings for the call to CALLREGEXEC() in pp_substcont: on subsequent matches we always set REXEC_NOT_FIRST, which forces the regex engine not to copy anyway, so passing the REXEC_COPY_STR is pointless, as is the conditional code to set it. 3. (whitespace change): split a conditional expression over 2 lines for easier reading. M pp.c M pp_ctl.c M pp_hot.c commit a41aa44d9dc4a3ba586d871754bd11137bdc37a2 Author: David Mitchell <[email protected]> Date: Fri Aug 24 16:17:47 2012 +0100 stop $foo =~ /(bar)/g skipping copy Normally in the presence of captures, a successful regex execution makes a copy of the matched string, so that $1 et al give the right value even if the original string is changed; i.e. $foo =~ /(123)/g; $foo = "bar"; is("$1", "123"); Until now that test would fail, because perl used to skip the copy for the scalar /(...)/g case (but not the C<$&; //g> case). This was to avoid a huge slowdown in code like the following: $x = 'x' x 1_000_000; 1 while $x =~ /(.)/g; which would otherwise end up copying a 1Mb string a million times. Now that (with the last commit but one) we copy only the required substring of the original string (a 1-byte substring in the above example), we can remove this fast-but-incorrect hack. M pp_hot.c M t/re/pat_advanced.t M t/re/pat_psycho.t commit 9414be0160a1f343d4ae75ec161fec610da39c84 Author: David Mitchell <[email protected]> Date: Fri Aug 24 15:49:21 2012 +0100 rationalise t/re/pat_psycho.t Do some cleanup of this file, without changing its functionality. Once upon a time, the psycho tests were scattered throughout a single pat.t file, before being moved into their own file. Now that they're all in a single file, make the $PERL_SKIP_PSYCHO_TEST test a single "skip_all" test at the beginning of the file, rather than testing it separately in each code block. Also, make some of the test descriptions more useful, and add a bit of debugging output. M t/re/pat_psycho.t commit 6502e08109cd003b2cdf39bc94ef35e52203240b Author: David Mitchell <[email protected]> Date: Thu Jul 26 16:04:09 2012 +0100 Don't copy all of the match string buffer When a pattern matches, and that pattern contains captures (or $`, $&, $' or /p are present), a copy is made of the whole original string, so that $1 et al continue to hold the correct value even if the original string is subsequently modified. This can have severe performance penalties; for example, this code causes a 1Mb buffer to be allocated, copied and freed a million times: $&; $x = 'x' x 1_000_000; 1 while $x =~ /(.)/g; This commit changes this so that, where possible, only the needed substring of the original string is copied: in the above case, only a 1-byte buffer is copied each time. Also, it now reuses or reallocs the buffer, rather than freeing and mallocing each time. Now that PL_sawampersand is a 3-bit flag indicating separately whether $`, $& and $' have been seen, they each contribute only their own individual penalty; which ones have been seen will limit the extent to which we can avoid copying the whole buffer. Note that the above code *without* the $& is not currently slow, but only because the copying is artificially disabled to avoid the performance hit. The next but one commit will remove that hack, meaning that it will still be fast, but will now be correct in the presence of a modified original string. We achieve this by by adding suboffset and subcoffset fields to the existing subbeg and sublen fields of a regex, to indicate how many bytes and characters have been skipped from the logical start of the string till the physical start of the buffer. To avoid copying stuff at the end, we just reduce sublen. For example, in this: "abcdefgh" =~ /(c)d/ subbeg points to a malloced buffer containing "c\0"; sublen == 1, and suboffset == 2 (as does subcoffset). while if $& has been seen, subbeg points to a malloced buffer containing "cd\0"; sublen == 2, and suboffset == 2. If in addition $' has been seen, then subbeg points to a malloced buffer containing "cdefgh\0"; sublen == 6, and suboffset == 2. The regex engine won't do this by default; there are two new flag bits, REXEC_COPY_SKIP_PRE and REXEC_COPY_SKIP_POST, which in conjunction with REXEC_COPY_STR, request that the engine skip the start or end of the buffer (it will still copy in the presence of the relevant $`, $&, $', /p). Only pp_match has been enhanced to use these extra flags; substitution can't easily benefit, since the usual action of s///g is to copy the whole string first time round, then perform subsequent matching iterations against the copy, without further copying. So you still need to copy most of the buffer. M dump.c M ext/Devel-Peek/t/Peek.t M mg.c M pod/perlreapi.pod M pp.c M pp_ctl.c M pp_hot.c M regcomp.c M regexec.c M regexp.h M t/porting/known_pod_issues.dat M t/re/re_tests commit 2c7b5d7698f52b86acffe19a7ec15e85c99337fe Author: David Mitchell <[email protected]> Date: Thu Jul 26 15:35:39 2012 +0100 Separate handling of ${^PREMATCH} from $` etc Currently the handling of getting the value, length etc of ${^PREMATCH} etc is identical to that of $` etc. Handle them separately, by adding RX_BUFF_IDX_CARET_PREMATCH etc constants to the existing RX_BUFF_IDX_PREMATCH set. This allows, when retrieving them, to always return undef if the current match didn't use //p. Previously the result depended on stuff such as whether the (non-//p) pattern included captures or not. The documentation for ${^PREMATCH} etc states that it's only guaranteed to return a defined value when the last pattern was //p. As well as making things more consistent, this is a necessary prerequisite for the following commit, which may not always copy the whole string during a non-//p match. M mg.c M regcomp.c M regexp.h M t/re/reg_pmod.t commit ac0ba89f3ee4e5469c43dc0a34b548a9aa415f98 Author: David Mitchell <[email protected]> Date: Fri Jun 22 16:26:08 2012 +0100 regexec_flags(): simplify length calculation The code to calculate the length of the string to copy was PL_regeol - startpos + (stringarg - strbeg); This is a hangover from the original (perl 3) regexp implementation that under //i, copied and folded the original buffer: so startpos might not equal stringarg. These days it always is (except under a match failure with (*COMMIT), and the code we're interested is only executed on success). So simplify to just PL_regeol - strbeg. M regexec.c commit d3b97530399d61590a1500b52bdba553d657bda5 Author: David Mitchell <[email protected]> Date: Fri Jun 22 12:36:03 2012 +0100 PL_sawampersand: use 3 bit flags rather than bool Set a separate flag for each of $`, $& and $'. It still works fine in boolean context. This will allow us to have more refined control over what parts of a match string to copy (we currently copy the whole string). M gv.c M intrpvar.h M perl.c M perl.h commit 8fd1a95029bf0ff87a3064dec7d6645f40359f2c Author: David Mitchell <[email protected]> Date: Wed Jun 20 14:17:05 2012 +0100 document args to regexec_flags and API Document in the API, and clarify in the source code, what the arguments to Perl_regexec_flags are. NB: this info is based on code inspection, not any real knowledge on my part. M pod/perlreapi.pod M regexec.c ----------------------------------------------------------------------- Summary of changes: dump.c | 4 + ext/Devel-Peek/t/Peek.t | 2 + gv.c | 31 +++++++-- intrpvar.h | 2 +- mg.c | 74 +++++++++++----------- perl.c | 7 +- perl.h | 6 ++ pod/perlreapi.pod | 66 ++++++++++++++++++- pp.c | 10 +-- pp_ctl.c | 17 +++--- pp_hot.c | 39 ++++++++---- regcomp.c | 83 +++++++++++++++++------- regexec.c | 137 ++++++++++++++++++++++++++++++++++----- regexp.h | 34 +++++++++- t/porting/known_pod_issues.dat | 2 +- t/re/pat_advanced.t | 1 - t/re/pat_psycho.t | 65 ++++++++++++++++--- t/re/re_tests | 6 ++ t/re/reg_pmod.t | 58 +++++++++++------- 19 files changed, 485 insertions(+), 159 deletions(-) diff --git a/dump.c b/dump.c index ada6ae9..46893d6 100644 --- a/dump.c +++ b/dump.c @@ -2056,6 +2056,10 @@ Perl_do_sv_dump(pTHX_ I32 level, PerlIO *file, SV *sv, I32 nest, I32 maxnest, bo (UV)(r->pre_prefix)); Perl_dump_indent(aTHX_ level, file, " SUBLEN = %"IVdf"\n", (IV)(r->sublen)); + Perl_dump_indent(aTHX_ level, file, " SUBOFFSET = %"IVdf"\n", + (IV)(r->suboffset)); + Perl_dump_indent(aTHX_ level, file, " SUBCOFFSET = %"IVdf"\n", + (IV)(r->subcoffset)); if (r->subbeg) Perl_dump_indent(aTHX_ level, file, " SUBBEG = 0x%"UVxf" %s\n", PTR2UV(r->subbeg), diff --git a/ext/Devel-Peek/t/Peek.t b/ext/Devel-Peek/t/Peek.t index 6913d59..164e2ff 100644 --- a/ext/Devel-Peek/t/Peek.t +++ b/ext/Devel-Peek/t/Peek.t @@ -350,6 +350,8 @@ do_test('reference to regexp', GOFS = 0 PRE_PREFIX = 4 SUBLEN = 0 + SUBOFFSET = 0 + SUBCOFFSET = 0 SUBBEG = 0x0 ENGINE = $ADDR MOTHER_RE = $ADDR diff --git a/gv.c b/gv.c index c6e474e..e29f2fd 100644 --- a/gv.c +++ b/gv.c @@ -1655,12 +1655,23 @@ Perl_gv_fetchpvn_flags(pTHX_ const char *nambeg, STRLEN full_len, I32 flags, require_tie_mod(gv, name, newSVpvs("Tie::Hash::NamedCapture"), "TIEHASH", 0); } if (sv_type==SVt_PV || sv_type==SVt_PVGV) { - if (*name == '[') - require_tie_mod(gv,name,newSVpvs("arybase"),"FETCH",0); - else if (*name == '&' || *name == '`' || *name == '\'') { - PL_sawampersand = TRUE; - (void)GvSVn(gv); - } + switch (*name) { + case '[': + require_tie_mod(gv,name,newSVpvs("arybase"),"FETCH",0); + break; + case '`': + PL_sawampersand |= SAWAMPERSAND_LEFT; + (void)GvSVn(gv); + break; + case '&': + PL_sawampersand |= SAWAMPERSAND_MIDDLE; + (void)GvSVn(gv); + break; + case '\'': + PL_sawampersand |= SAWAMPERSAND_RIGHT; + (void)GvSVn(gv); + break; + } } } else if (len == 3 && sv_type == SVt_PVAV @@ -1866,7 +1877,13 @@ Perl_gv_fetchpvn_flags(pTHX_ const char *nambeg, STRLEN full_len, I32 flags, sv_type == SVt_PVCV || sv_type == SVt_PVFM || sv_type == SVt_PVIO - )) { PL_sawampersand = TRUE; } + )) { PL_sawampersand |= + (*name == '`') + ? SAWAMPERSAND_LEFT + : (*name == '&') + ? SAWAMPERSAND_MIDDLE + : SAWAMPERSAND_RIGHT; + } goto magicalize; case ':': /* $: */ diff --git a/intrpvar.h b/intrpvar.h index f57fa7d..94b7425 100644 --- a/intrpvar.h +++ b/intrpvar.h @@ -292,7 +292,7 @@ The C variable which corresponds to Perl's $^W warning variable. */ PERLVAR(I, dowarn, U8) -PERLVAR(I, sawampersand, bool) /* must save all match strings */ +PERLVAR(I, sawampersand, U8) /* must save all match strings */ PERLVAR(I, unsafe, bool) PERLVAR(I, exit_flags, U8) /* was exit() unexpected, etc. */ diff --git a/mg.c b/mg.c index 1f6d062..26cabbe 100644 --- a/mg.c +++ b/mg.c @@ -637,6 +637,8 @@ Perl_magic_regdata_cnt(pTHX_ SV *sv, MAGIC *mg) return (U32)-1; } +/* @-, @+ */ + int Perl_magic_regdatum_get(pTHX_ SV *sv, MAGIC *mg) { @@ -665,7 +667,9 @@ Perl_magic_regdatum_get(pTHX_ SV *sv, MAGIC *mg) if (i > 0 && RX_MATCH_UTF8(rx)) { const char * const b = RX_SUBBEG(rx); if (b) - i = utf8_length((U8*)b, (U8*)(b+i)); + i = RX_SUBCOFFSET(rx) + + utf8_length((U8*)b, + (U8*)(b-RX_SUBOFFSET(rx)+i)); } sv_setiv(sv, i); @@ -675,6 +679,8 @@ Perl_magic_regdatum_get(pTHX_ SV *sv, MAGIC *mg) return 0; } +/* @-, @+ */ + int Perl_magic_regdatum_set(pTHX_ SV *sv, MAGIC *mg) { @@ -913,9 +919,12 @@ Perl_magic_get(pTHX_ SV *sv, MAGIC *mg) if (nextchar == '\0') { /* ^P */ sv_setiv(sv, (IV)PL_perldb); } else if (strEQ(remaining, "REMATCH")) { /* $^PREMATCH */ - goto do_prematch_fetch; + + paren = RX_BUFF_IDX_CARET_PREMATCH; + goto do_numbuf_fetch; } else if (strEQ(remaining, "OSTMATCH")) { /* $^POSTMATCH */ - goto do_postmatch_fetch; + paren = RX_BUFF_IDX_CARET_POSTMATCH; + goto do_numbuf_fetch; } break; case '\023': /* ^S */ @@ -978,55 +987,46 @@ Perl_magic_get(pTHX_ SV *sv, MAGIC *mg) break; case '\015': /* $^MATCH */ if (strEQ(remaining, "ATCH")) { + paren = RX_BUFF_IDX_CARET_FULLMATCH; + goto do_numbuf_fetch; + } + case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9': case '&': - if (PL_curpm && (rx = PM_GETRE(PL_curpm))) { - /* - * Pre-threads, this was paren = atoi(GvENAME((const GV *)mg->mg_obj)); - * XXX Does the new way break anything? - */ - paren = atoi(mg->mg_ptr); /* $& is in [0] */ - CALLREG_NUMBUF_FETCH(rx,paren,sv); - break; - } - sv_setsv(sv,&PL_sv_undef); - } + /* + * Pre-threads, this was paren = atoi(GvENAME((const GV *)mg->mg_obj)); + * XXX Does the new way break anything? + */ + paren = atoi(mg->mg_ptr); /* $& is in [0] */ + do_numbuf_fetch: + if (PL_curpm && (rx = PM_GETRE(PL_curpm))) { + CALLREG_NUMBUF_FETCH(rx,paren,sv); + break; + } + sv_setsv(sv,&PL_sv_undef); break; case '+': if (PL_curpm && (rx = PM_GETRE(PL_curpm))) { - if (RX_LASTPAREN(rx)) { - CALLREG_NUMBUF_FETCH(rx,RX_LASTPAREN(rx),sv); - break; - } + paren = RX_LASTPAREN(rx); + if (paren) + goto do_numbuf_fetch; } sv_setsv(sv,&PL_sv_undef); break; case '\016': /* ^N */ if (PL_curpm && (rx = PM_GETRE(PL_curpm))) { - if (RX_LASTCLOSEPAREN(rx)) { - CALLREG_NUMBUF_FETCH(rx,RX_LASTCLOSEPAREN(rx),sv); - break; - } - + paren = RX_LASTCLOSEPAREN(rx); + if (paren) + goto do_numbuf_fetch; } sv_setsv(sv,&PL_sv_undef); break; case '`': - do_prematch_fetch: - if (PL_curpm && (rx = PM_GETRE(PL_curpm))) { - CALLREG_NUMBUF_FETCH(rx,-2,sv); - break; - } - sv_setsv(sv,&PL_sv_undef); - break; + paren = RX_BUFF_IDX_PREMATCH; + goto do_numbuf_fetch; case '\'': - do_postmatch_fetch: - if (PL_curpm && (rx = PM_GETRE(PL_curpm))) { - CALLREG_NUMBUF_FETCH(rx,-1,sv); - break; - } - sv_setsv(sv,&PL_sv_undef); - break; + paren = RX_BUFF_IDX_POSTMATCH; + goto do_numbuf_fetch; case '.': if (GvIO(PL_last_in_gv)) { sv_setiv(sv, (IV)IoLINES(GvIOp(PL_last_in_gv))); diff --git a/perl.c b/perl.c index 8444218..7d65719 100644 --- a/perl.c +++ b/perl.c @@ -860,7 +860,7 @@ perl_destruct(pTHXx) PL_minus_F = FALSE; PL_doswitches = FALSE; PL_dowarn = G_WARN_OFF; - PL_sawampersand = FALSE; /* must save all match strings */ + PL_sawampersand = 0; /* must save all match strings */ PL_unsafe = FALSE; Safefree(PL_inplace); @@ -2343,8 +2343,9 @@ STATIC void S_run_body(pTHX_ I32 oldscope) { dVAR; - DEBUG_r(PerlIO_printf(Perl_debug_log, "%s $` $& $' support.\n", - PL_sawampersand ? "Enabling" : "Omitting")); + DEBUG_r(PerlIO_printf(Perl_debug_log, "%s $` $& $' support (0x%x).\n", + PL_sawampersand ? "Enabling" : "Omitting", + (unsigned int)(PL_sawampersand))); if (!PL_restartop) { #ifdef PERL_MAD diff --git a/perl.h b/perl.h index 2cc4e91..b299432 100644 --- a/perl.h +++ b/perl.h @@ -4854,6 +4854,12 @@ typedef enum { #define HINT_SORT_MERGESORT 0x00000002 #define HINT_SORT_STABLE 0x00000100 /* sort styles (currently one) */ +/* flags for PL_sawampersand */ + +#define SAWAMPERSAND_LEFT 1 /* saw $` */ +#define SAWAMPERSAND_MIDDLE 2 /* saw $& */ +#define SAWAMPERSAND_RIGHT 4 /* saw $' */ + /* Various states of the input record separator SV (rs) */ #define RsSNARF(sv) (! SvOK(sv)) #define RsSIMPLE(sv) (SvOK(sv) && (! SvPOK(sv) || SvCUR(sv))) diff --git a/pod/perlreapi.pod b/pod/perlreapi.pod index 35b6b74..1ccc6d8 100644 --- a/pod/perlreapi.pod +++ b/pod/perlreapi.pod @@ -209,7 +209,49 @@ faster than C<unpack>. I32 minend, SV* screamer, void* data, U32 flags); -Execute a regexp. +Execute a regexp. The arguments are + +=over 4 + +=item rx + +The regular expression to execute. + +=item screamer + +This strangely-named arg is the SV to be matched against. Note that the +actual char array to be matched against is supplied by the arguments +described below; the SV is just used to determine UTF8ness, C<pos()> etc. + +=item strbeg + +Pointer to the physical start of the string. + +=item strend + +Pointer to the character following the physical end of the string (i.e. +the \0). + +=item stringarg + +Pointer to the position in the string where matching should start; it might +not be equal to C<strbeg> (for example in a later iteration of C</.../g>). + +=item minend + +Minimum length of string (measured in bytes from C<stringarg>) that must +match; if the engine reaches the end of the match but hasn't reached this +position in the string, it should fail. + +=item data + +Optimisation data; subject to change. + +=item flags + +Optimisation flags; subject to change. + +=back =head2 intuit @@ -513,6 +555,8 @@ values. char *subbeg; /* saved or original string so \digit works forever. */ SV_SAVED_COPY /* If non-NULL, SV which is COW from original */ I32 sublen; /* Length of string pointed by subbeg */ + I32 suboffset; /* byte offset of subbeg from logical start of str */ + I32 subcoffset; /* suboffset equiv, but in chars (for @-/@+) */ /* Information about the match that isn't often used */ I32 prelen; /* length of precomp */ @@ -653,9 +697,23 @@ occur at a floating offset from the start of the pattern. Used to do Fast-Boyer-Moore searches on the string to find out if its worth using the regex engine at all, and if so where in the string to search. -=head2 C<subbeg> C<sublen> C<saved_copy> - -Used during execution phase for managing search and replace patterns. +=head2 C<subbeg> C<sublen> C<saved_copy> C<suboffset> C<subcoffset> + +Used during the execution phase for managing search and replace patterns, +and for providing the text for C<$&>, C<$1> etc. C<subbeg> points to a +buffer (either the original string, or a copy in the case of +C<RX_MATCH_COPIED(rx)>), and C<sublen> is the length of the buffer. The +C<RX_OFFS> start and end indices index into this buffer. + +In the presence of the C<REXEC_COPY_STR> flag, but with the addition of +the C<REXEC_COPY_SKIP_PRE> or C<REXEC_COPY_SKIP_POST> flags, an engine +can choose not to copy the full buffer (although it must still do so in +the presence of C<RXf_PMf_KEEPCOPY> or the relevant bits being set in +C<PL_sawampersand>). In this case, it may set C<suboffset> to indicate the +number of bytes from the logical start of the buffer to the physical start +(i.e. C<subbeg>). It should also set C<subcoffset>, the number of +characters in the offset. The latter is needed to support C<@-> and C<@+> +which work in characters, not bytes. =head2 C<wrapped> C<wraplen> diff --git a/pp.c b/pp.c index 29db8ed..e1a6c78 100644 --- a/pp.c +++ b/pp.c @@ -5549,13 +5549,9 @@ PP(pp_split) if (rex_return == 0) break; TAINT_IF(RX_MATCH_TAINTED(rx)); - if (RX_MATCH_COPIED(rx) && RX_SUBBEG(rx) != orig) { - m = s; - s = orig; - orig = RX_SUBBEG(rx); - s = orig + (m - s); - strend = s + (strend - m); - } + /* we never pass the REXEC_COPY_STR flag, so it should + * never get copied */ + assert(!RX_MATCH_COPIED(rx)); m = RX_OFFS(rx)[0].start + orig; if (gimme_scalar) { diff --git a/pp_ctl.c b/pp_ctl.c index 1477373..af0d558 100644 --- a/pp_ctl.c +++ b/pp_ctl.c @@ -216,9 +216,7 @@ PP(pp_substcont) if (CxONCE(cx) || s < orig || !CALLREGEXEC(rx, s, cx->sb_strend, orig, (s == m) + RX_GOFS(rx), cx->sb_targ, NULL, - ((cx->sb_rflags & REXEC_COPY_STR) - ? (REXEC_IGNOREPOS|REXEC_NOT_FIRST) - : (REXEC_COPY_STR|REXEC_IGNOREPOS|REXEC_NOT_FIRST)))) + (REXEC_IGNOREPOS|REXEC_NOT_FIRST))) { SV *targ = cx->sb_targ; @@ -289,6 +287,7 @@ PP(pp_substcont) if (RX_MATCH_COPIED(rx) && RX_SUBBEG(rx) != orig) { m = s; s = orig; + assert(!RX_SUBOFFSET(rx)); cx->sb_orig = orig = RX_SUBBEG(rx); s = orig + (m - s); cx->sb_strend = s + (cx->sb_strend - m); @@ -353,9 +352,9 @@ Perl_rxres_save(pTHX_ void **rsp, REGEXP *rx) if (!p || p[1] < RX_NPARENS(rx)) { #ifdef PERL_OLD_COPY_ON_WRITE - i = 7 + RX_NPARENS(rx) * 2; + i = 7 + (RX_NPARENS(rx)+1) * 2; #else - i = 6 + RX_NPARENS(rx) * 2; + i = 6 + (RX_NPARENS(rx)+1) * 2; #endif if (!p) Newx(p, i, UV); @@ -364,7 +363,7 @@ Perl_rxres_save(pTHX_ void **rsp, REGEXP *rx) *rsp = (void*)p; } - *p++ = PTR2UV(RX_MATCH_COPIED(rx) ? RX_SUBBEG(rx) : NULL); + *p++ = RX_MATCH_COPIED(rx) ? 1 : 0; RX_MATCH_COPIED_off(rx); #ifdef PERL_OLD_COPY_ON_WRITE @@ -373,9 +372,10 @@ Perl_rxres_save(pTHX_ void **rsp, REGEXP *rx) #endif *p++ = RX_NPARENS(rx); - *p++ = PTR2UV(RX_SUBBEG(rx)); *p++ = (UV)RX_SUBLEN(rx); + *p++ = (UV)RX_SUBOFFSET(rx); + *p++ = (UV)RX_SUBCOFFSET(rx); for (i = 0; i <= RX_NPARENS(rx); ++i) { *p++ = (UV)RX_OFFS(rx)[i].start; *p++ = (UV)RX_OFFS(rx)[i].end; @@ -403,9 +403,10 @@ S_rxres_restore(pTHX_ void **rsp, REGEXP *rx) #endif RX_NPARENS(rx) = *p++; - RX_SUBBEG(rx) = INT2PTR(char*,*p++); RX_SUBLEN(rx) = (I32)(*p++); + RX_SUBOFFSET(rx) = (I32)*p++; + RX_SUBCOFFSET(rx) = (I32)*p++; for (i = 0; i <= RX_NPARENS(rx); ++i) { RX_OFFS(rx)[i].start = (I32)(*p++); RX_OFFS(rx)[i].end = (I32)(*p++); diff --git a/pp_hot.c b/pp_hot.c index 6c3f4f6..0d70dfc 100644 --- a/pp_hot.c +++ b/pp_hot.c @@ -1319,15 +1319,18 @@ PP(pp_match) } } } - /* XXX: comment out !global get safe $1 vars after a - match, BUT be aware that this leads to dramatic slowdowns on - /g matches against large strings. So far a solution to this problem - appears to be quite tricky. - Test for the unsafe vars are TODO for now. */ - if ( (!global && RX_NPARENS(rx)) - || SvTEMP(TARG) || SvAMAGIC(TARG) || PL_sawampersand - || (RX_EXTFLAGS(rx) & (RXf_EVAL_SEEN|RXf_PMf_KEEPCOPY))) - r_flags |= REXEC_COPY_STR; + if ( RX_NPARENS(rx) + || PL_sawampersand + || (RX_EXTFLAGS(rx) & (RXf_EVAL_SEEN|RXf_PMf_KEEPCOPY)) + ) { + r_flags |= (REXEC_COPY_STR|REXEC_COPY_SKIP_PRE); + /* in @a =~ /(.)/g, we iterate multiple times, but copy the buffer + * only on the first iteration. Therefore we need to copy $' as well + * as $&, to make the rest of the string available for captures in + * subsequent iterations */ + if (! (global && gimme == G_ARRAY)) + r_flags |= REXEC_COPY_SKIP_POST; + }; play_it_again: if (global && RX_OFFS(rx)[0].start != -1) { @@ -1472,6 +1475,8 @@ yup: /* Confirmed by INTUIT */ if (global) { /* FIXME - should rx->subbeg be const char *? */ RX_SUBBEG(rx) = (char *) truebase; + RX_SUBOFFSET(rx) = 0; + RX_SUBCOFFSET(rx) = 0; RX_OFFS(rx)[0].start = s - truebase; if (RX_MATCH_UTF8(rx)) { char * const t = (char*)utf8_hop((U8*)s, RX_MINLENRET(rx)); @@ -1507,6 +1512,8 @@ yup: /* Confirmed by INTUIT */ #endif } RX_SUBLEN(rx) = strend - t; + RX_SUBOFFSET(rx) = 0; + RX_SUBCOFFSET(rx) = 0; RX_MATCH_COPIED_on(rx); off = RX_OFFS(rx)[0].start = s - t; RX_OFFS(rx)[0].end = off + RX_MINLENRET(rx); @@ -2127,9 +2134,13 @@ PP(pp_subst) pm = PL_curpm; rx = PM_GETRE(pm); } - r_flags = (RX_NPARENS(rx) || SvTEMP(TARG) || PL_sawampersand - || (RX_EXTFLAGS(rx) & (RXf_EVAL_SEEN|RXf_PMf_KEEPCOPY)) ) - ? REXEC_COPY_STR : 0; + + r_flags = ( RX_NPARENS(rx) + || PL_sawampersand + || (RX_EXTFLAGS(rx) & (RXf_EVAL_SEEN|RXf_PMf_KEEPCOPY)) + ) + ? REXEC_COPY_STR + : 0; orig = m = s; if (RX_EXTFLAGS(rx) & RXf_USE_INTUIT) { @@ -2203,7 +2214,8 @@ PP(pp_subst) #ifdef PERL_OLD_COPY_ON_WRITE && !is_cow #endif - && (I32)clen <= RX_MINLENRET(rx) && (once || !(r_flags & REXEC_COPY_STR)) + && (I32)clen <= RX_MINLENRET(rx) + && (once || !(r_flags & REXEC_COPY_STR)) && !(RX_EXTFLAGS(rx) & RXf_LOOKBEHIND_SEEN) && (!doutf8 || SvUTF8(TARG)) && !(rpm->op_pmflags & PMf_NONDESTRUCT)) @@ -2331,6 +2343,7 @@ PP(pp_subst) if (RX_MATCH_COPIED(rx) && RX_SUBBEG(rx) != orig) { m = s; s = orig; + assert(RX_SUBOFFSET(rx) == 0); orig = RX_SUBBEG(rx); s = orig + (m - s); strend = s + (strend - m); diff --git a/regcomp.c b/regcomp.c index 921c0e9..a9e92e1 100644 --- a/regcomp.c +++ b/regcomp.c @@ -6691,37 +6691,53 @@ Perl_reg_numbered_buff_fetch(pTHX_ REGEXP * const r, const I32 paren, char *s = NULL; I32 i = 0; I32 s1, t1; + I32 n = paren; PERL_ARGS_ASSERT_REG_NUMBERED_BUFF_FETCH; - if (!rx->subbeg) { - sv_setsv(sv,&PL_sv_undef); - return; - } - else - if (paren == RX_BUFF_IDX_PREMATCH && rx->offs[0].start != -1) { - /* $` */ + if ( ( n == RX_BUFF_IDX_CARET_PREMATCH + || n == RX_BUFF_IDX_CARET_FULLMATCH + || n == RX_BUFF_IDX_CARET_POSTMATCH + ) + && !(rx->extflags & RXf_PMf_KEEPCOPY) + ) + goto ret_undef; + + if (!rx->subbeg) + goto ret_undef; + + if (n == RX_BUFF_IDX_CARET_FULLMATCH) + /* no need to distinguish between them any more */ + n = RX_BUFF_IDX_FULLMATCH; + + if ((n == RX_BUFF_IDX_PREMATCH || n == RX_BUFF_IDX_CARET_PREMATCH) + && rx->offs[0].start != -1) + { + /* $`, ${^PREMATCH} */ i = rx->offs[0].start; s = rx->subbeg; } else - if (paren == RX_BUFF_IDX_POSTMATCH && rx->offs[0].end != -1) { - /* $' */ - s = rx->subbeg + rx->offs[0].end; - i = rx->sublen - rx->offs[0].end; + if ((n == RX_BUFF_IDX_POSTMATCH || n == RX_BUFF_IDX_CARET_POSTMATCH) + && rx->offs[0].end != -1) + { + /* $', ${^POSTMATCH} */ + s = rx->subbeg - rx->suboffset + rx->offs[0].end; + i = rx->sublen + rx->suboffset - rx->offs[0].end; } else - if ( 0 <= paren && paren <= (I32)rx->nparens && - (s1 = rx->offs[paren].start) != -1 && - (t1 = rx->offs[paren].end) != -1) + if ( 0 <= n && n <= (I32)rx->nparens && + (s1 = rx->offs[n].start) != -1 && + (t1 = rx->offs[n].end) != -1) { - /* $& $1 ... */ + /* $&, ${^MATCH}, $1 ... */ i = t1 - s1; - s = rx->subbeg + s1; + s = rx->subbeg + s1 - rx->suboffset; } else { - sv_setsv(sv,&PL_sv_undef); - return; + goto ret_undef; } + + assert(s >= rx->subbeg); assert(rx->sublen >= (s - rx->subbeg) + i ); if (i >= 0) { const int oldtainted = PL_tainted; @@ -6757,6 +6773,7 @@ Perl_reg_numbered_buff_fetch(pTHX_ REGEXP * const r, const I32 paren, SvTAINTED_off(sv); } } else { + ret_undef: sv_setsv(sv,&PL_sv_undef); return; } @@ -6783,13 +6800,18 @@ Perl_reg_numbered_buff_length(pTHX_ REGEXP * const r, const SV * const sv, struct regexp *const rx = (struct regexp *)SvANY(r); I32 i; I32 s1, t1; + I32 n = paren; PERL_ARGS_ASSERT_REG_NUMBERED_BUFF_LENGTH; /* Some of this code was originally in C<Perl_magic_len> in F<mg.c> */ - switch (paren) { - /* $` / ${^PREMATCH} */ - case RX_BUFF_IDX_PREMATCH: + switch (paren) { + case RX_BUFF_IDX_CARET_PREMATCH: /* ${^PREMATCH} */ + if (!(rx->extflags & RXf_PMf_KEEPCOPY)) + goto warn_undef; + /*FALLTHROUGH*/ + + case RX_BUFF_IDX_PREMATCH: /* $` */ if (rx->offs[0].start != -1) { i = rx->offs[0].start; if (i > 0) { @@ -6799,8 +6821,11 @@ Perl_reg_numbered_buff_length(pTHX_ REGEXP * const r, const SV * const sv, } } return 0; - /* $' / ${^POSTMATCH} */ - case RX_BUFF_IDX_POSTMATCH: + + case RX_BUFF_IDX_CARET_POSTMATCH: /* ${^POSTMATCH} */ + if (!(rx->extflags & RXf_PMf_KEEPCOPY)) + goto warn_undef; + case RX_BUFF_IDX_POSTMATCH: /* $' */ if (rx->offs[0].end != -1) { i = rx->sublen - rx->offs[0].end; if (i > 0) { @@ -6810,6 +6835,13 @@ Perl_reg_numbered_buff_length(pTHX_ REGEXP * const r, const SV * const sv, } } return 0; + + case RX_BUFF_IDX_CARET_FULLMATCH: /* ${^MATCH} */ + if (!(rx->extflags & RXf_PMf_KEEPCOPY)) + goto warn_undef; + n = RX_BUFF_IDX_FULLMATCH; + /*FALLTHROUGH*/ + /* $& / ${^MATCH}, $1, $2, ... */ default: if (paren <= (I32)rx->nparens && @@ -6819,6 +6851,7 @@ Perl_reg_numbered_buff_length(pTHX_ REGEXP * const r, const SV * const sv, i = t1 - s1; goto getlen; } else { + warn_undef: if (ckWARN(WARN_UNINITIALIZED)) report_uninit((const SV *)sv); return 0; @@ -6826,7 +6859,7 @@ Perl_reg_numbered_buff_length(pTHX_ REGEXP * const r, const SV * const sv, } getlen: if (i > 0 && RXp_MATCH_UTF8(rx)) { - const char * const s = rx->subbeg + s1; + const char * const s = rx->subbeg - rx->suboffset + s1; const U8 *ep; STRLEN el; @@ -14429,6 +14462,8 @@ Perl_save_re_context(pTHX) PL_reg_oldsaved = NULL; PL_reg_oldsavedlen = 0; + PL_reg_oldsavedoffset = 0; + PL_reg_oldsavedcoffset = 0; PL_reg_maxiter = 0; PL_reg_leftiter = 0; PL_reg_poscache = NULL; diff --git a/regexec.c b/regexec.c index 4c9a456..2dc2314 100644 --- a/regexec.c +++ b/regexec.c @@ -502,10 +502,13 @@ S_regcp_restore(pTHX_ regexp *rex, I32 ix) I32 Perl_pregexec(pTHX_ REGEXP * const prog, char* stringarg, register char *strend, char *strbeg, I32 minend, SV *screamer, U32 nosave) -/* strend: pointer to null at end of string */ -/* strbeg: real beginning of string */ -/* minend: end of match must be >=minend after stringarg. */ -/* nosave: For optimizations. */ +/* stringarg: the point in the string at which to begin matching */ +/* strend: pointer to null at end of string */ +/* strbeg: real beginning of string */ +/* minend: end of match must be >= minend bytes after stringarg. */ +/* screamer: SV being matched: only used for utf8 flag, pos() etc; string + * itself is accessed via the pointers above */ +/* nosave: For optimizations. */ { PERL_ARGS_ASSERT_PREGEXEC; @@ -2051,13 +2054,17 @@ S_find_byclass(pTHX_ regexp * prog, const regnode *c, char *s, I32 Perl_regexec_flags(pTHX_ REGEXP * const rx, char *stringarg, register char *strend, char *strbeg, I32 minend, SV *sv, void *data, U32 flags) -/* strend: pointer to null at end of string */ -/* strbeg: real beginning of string */ -/* minend: end of match must be >=minend after stringarg. */ -/* data: May be used for some additional optimizations. - Currently its only used, with a U32 cast, for transmitting - the ganch offset when doing a /g match. This will change */ -/* nosave: For optimizations. */ +/* stringarg: the point in the string at which to begin matching */ +/* strend: pointer to null at end of string */ +/* strbeg: real beginning of string */ +/* minend: end of match must be >= minend bytes after stringarg. */ +/* sv: SV being matched: only used for utf8 flag, pos() etc; string + * itself is accessed via the pointers above */ +/* data: May be used for some additional optimizations. + Currently its only used, with a U32 cast, for transmitting + the ganch offset when doing a /g match. This will change */ +/* nosave: For optimizations. */ + { dVAR; struct regexp *const prog = (struct regexp *)SvANY(rx); @@ -2559,9 +2566,7 @@ got_it: /* make sure $`, $&, $', and $digit will work later */ if ( !(flags & REXEC_NOT_FIRST) ) { - RX_MATCH_COPY_FREE(rx); if (flags & REXEC_COPY_STR) { - const I32 i = PL_regeol - startpos + (stringarg - strbeg); #ifdef PERL_OLD_COPY_ON_WRITE if ((SvIsCOW(sv) || (SvFLAGS(sv) & CAN_COW_MASK) == CAN_COW_FLAGS)) { @@ -2573,17 +2578,105 @@ got_it: prog->saved_copy = sv_setsv_cow(prog->saved_copy, sv); prog->subbeg = (char *)SvPVX_const(prog->saved_copy); assert (SvPOKp(prog->saved_copy)); + prog->sublen = PL_regeol - strbeg; + prog->suboffset = 0; + prog->subcoffset = 0; } else #endif { - RX_MATCH_COPIED_on(rx); - s = savepvn(strbeg, i); - prog->subbeg = s; - } - prog->sublen = i; + I32 min = 0; + I32 max = PL_regeol - strbeg; + I32 sublen; + + if ( (flags & REXEC_COPY_SKIP_POST) + && !(RX_EXTFLAGS(rx) & RXf_PMf_KEEPCOPY) /* //p */ + && !(PL_sawampersand & SAWAMPERSAND_RIGHT) + ) { /* don't copy $' part of string */ + U32 n = 0; + max = -1; + /* calculate the right-most part of the string covered + * by a capture. Due to look-ahead, this may be to + * the right of $&, so we have to scan all captures */ + while (n <= prog->lastparen) { + if (prog->offs[n].end > max) + max = prog->offs[n].end; + n++; + } + if (max == -1) + max = (PL_sawampersand & SAWAMPERSAND_LEFT) + ? prog->offs[0].start + : 0; + assert(max >= 0 && max <= PL_regeol - strbeg); + } + + if ( (flags & REXEC_COPY_SKIP_PRE) + && !(RX_EXTFLAGS(rx) & RXf_PMf_KEEPCOPY) /* //p */ + && !(PL_sawampersand & SAWAMPERSAND_LEFT) + ) { /* don't copy $` part of string */ + U32 n = 0; + min = max; + /* calculate the left-most part of the string covered + * by a capture. Due to look-behind, this may be to + * the left of $&, so we have to scan all captures */ + while (min && n <= prog->lastparen) { + if ( prog->offs[n].start != -1 + && prog->offs[n].start < min) + { + min = prog->offs[n].start; + } + n++; + } + if ((PL_sawampersand & SAWAMPERSAND_RIGHT) + && min > prog->offs[0].end + ) + min = prog->offs[0].end; + + } + + assert(min >= 0 && min <= max && min <= PL_regeol - strbeg); + sublen = max - min; + + if (RX_MATCH_COPIED(rx)) { + if (sublen > prog->sublen) + prog->subbeg = + (char*)saferealloc(prog->subbeg, sublen+1); + } + else + prog->subbeg = (char*)safemalloc(sublen+1); + Copy(strbeg + min, prog->subbeg, sublen, char); + prog->subbeg[sublen] = '\0'; + prog->suboffset = min; + prog->sublen = sublen; + } + RX_MATCH_COPIED_on(rx); + prog->subcoffset = prog->suboffset; + if (prog->suboffset && utf8_target) { + /* Convert byte offset to chars. + * XXX ideally should only compute this if @-/@+ + * has been seen, a la PL_sawampersand ??? */ + + /* If there's a direct correspondence between the + * string which we're matching and the original SV, + * then we can use the utf8 len cache associated with + * the SV. In particular, it means that under //g, + * sv_pos_b2u() will use the previously cached + * position to speed up working out the new length of + * subcoffset, rather than counting from the start of + * the string each time. This stops + * $x = "\x{100}" x 1E6; 1 while $x =~ /(.)/g; + * from going quadratic */ + if (SvPOKp(sv) && SvPVX(sv) == strbeg) + sv_pos_b2u(sv, &(prog->subcoffset)); + else + prog->subcoffset = utf8_length((U8*)strbeg, + (U8*)(strbeg+prog->suboffset)); + } } else { + RX_MATCH_COPY_FREE(rx); prog->subbeg = strbeg; + prog->suboffset = 0; + prog->subcoffset = 0; prog->sublen = PL_regeol - strbeg; /* strend may have been modified */ } } @@ -2688,6 +2781,8 @@ S_regtry(pTHX_ regmatch_info *reginfo, char **startpos) $` inside (?{}) could fail... */ PL_reg_oldsaved = prog->subbeg; PL_reg_oldsavedlen = prog->sublen; + PL_reg_oldsavedoffset = prog->suboffset; + PL_reg_oldsavedcoffset = prog->suboffset; #ifdef PERL_OLD_COPY_ON_WRITE PL_nrs = prog->saved_copy; #endif @@ -2696,6 +2791,8 @@ S_regtry(pTHX_ regmatch_info *reginfo, char **startpos) else PL_reg_oldsaved = NULL; prog->subbeg = PL_bostr; + prog->suboffset = 0; + prog->subcoffset = 0; prog->sublen = PL_regeol - PL_bostr; /* strend may have been modified */ } #ifdef DEBUGGING @@ -4528,6 +4625,8 @@ S_regmatch(pTHX_ regmatch_info *reginfo, regnode *prog) RXp_MATCH_COPIED_off(re); re->subbeg = rex->subbeg; re->sublen = rex->sublen; + re->suboffset = rex->suboffset; + re->subcoffset = rex->subcoffset; rei = RXi_GET(re); DEBUG_EXECUTE_r( debug_start_match(re_sv, utf8_target, locinput, PL_regeol, @@ -7160,6 +7259,8 @@ restore_pos(pTHX_ void *arg) if (PL_reg_oldsaved) { rex->subbeg = PL_reg_oldsaved; rex->sublen = PL_reg_oldsavedlen; + rex->suboffset = PL_reg_oldsavedoffset; + rex->subcoffset = PL_reg_oldsavedcoffset; #ifdef PERL_OLD_COPY_ON_WRITE rex->saved_copy = PL_nrs; #endif diff --git a/regexp.h b/regexp.h index db36edd..3e245d0 100644 --- a/regexp.h +++ b/regexp.h @@ -124,6 +124,8 @@ struct reg_code_block { char *subbeg; \ SV_SAVED_COPY /* If non-NULL, SV which is COW from original */\ I32 sublen; /* Length of string pointed by subbeg */ \ + I32 suboffset; /* byte offset of subbeg from logical start of str */ \ + I32 subcoffset; /* suboffset equiv, but in chars (for @-/@+) */ \ /* Information about the match that isn't often used */ \ /* offset from wrapped to the start of precomp */ \ PERL_BITFIELD32 pre_prefix:4; \ @@ -181,9 +183,12 @@ typedef struct regexp_engine { paren name. >= 1 is reserved for actual numbered captures, i.e. $1, $2 etc. */ -#define RX_BUFF_IDX_PREMATCH -2 /* $` / ${^PREMATCH} */ -#define RX_BUFF_IDX_POSTMATCH -1 /* $' / ${^POSTMATCH} */ -#define RX_BUFF_IDX_FULLMATCH 0 /* $& / ${^MATCH} */ +#define RX_BUFF_IDX_CARET_PREMATCH -5 /* ${^PREMATCH} */ +#define RX_BUFF_IDX_CARET_POSTMATCH -4 /* ${^POSTMATCH} */ +#define RX_BUFF_IDX_CARET_FULLMATCH -3 /* ${^MATCH} */ +#define RX_BUFF_IDX_PREMATCH -2 /* $` */ +#define RX_BUFF_IDX_POSTMATCH -1 /* $' */ +#define RX_BUFF_IDX_FULLMATCH 0 /* $& */ /* Flags that are passed to the named_buff and named_buff_iter @@ -474,6 +479,18 @@ get_regex_charset_name(const U32 flags, STRLEN* const lenp) assert(SvTYPE(_rx_subbeg) == SVt_REGEXP); \ &SvANY(_rx_subbeg)->subbeg; \ })) +# define RX_SUBOFFSET(prog) \ + (*({ \ + const REGEXP *const _rx_suboffset = (prog); \ + assert(SvTYPE(_rx_suboffset) == SVt_REGEXP); \ + &SvANY(_rx_suboffset)->suboffset; \ + })) +# define RX_SUBCOFFSET(prog) \ + (*({ \ + const REGEXP *const _rx_subcoffset = (prog); \ + assert(SvTYPE(_rx_subcoffset) == SVt_REGEXP); \ + &SvANY(_rx_subcoffset)->subcoffset; \ + })) # define RX_OFFS(prog) \ (*({ \ const REGEXP *const _rx_offs = (prog); \ @@ -490,6 +507,8 @@ get_regex_charset_name(const U32 flags, STRLEN* const lenp) # define RX_EXTFLAGS(prog) RXp_EXTFLAGS((struct regexp *)SvANY(prog)) # define RX_ENGINE(prog) (((struct regexp *)SvANY(prog))->engine) # define RX_SUBBEG(prog) (((struct regexp *)SvANY(prog))->subbeg) +# define RX_SUBOFFSET(prog) (((struct regexp *)SvANY(prog))->suboffset) +# define RX_SUBCOFFSET(prog) (((struct regexp *)SvANY(prog))->subcoffset) # define RX_OFFS(prog) (((struct regexp *)SvANY(prog))->offs) # define RX_NPARENS(prog) (((struct regexp *)SvANY(prog))->nparens) #endif @@ -538,6 +557,11 @@ get_regex_charset_name(const U32 flags, STRLEN* const lenp) #define REXEC_SCREAM 0x04 /* use scream table. */ #define REXEC_IGNOREPOS 0x08 /* \G matches at start. */ #define REXEC_NOT_FIRST 0x10 /* This is another iteration of //g. */ + /* under REXEC_COPY_STR, it's ok for the + * engine (modulo PL_sawamperand etc) + * to skip copying ... */ +#define REXEC_COPY_SKIP_PRE 0x20 /* ...the $` part of the string, or */ +#define REXEC_COPY_SKIP_POST 0x40 /* ...the $' part of the string */ #if defined(__GNUC__) && !defined(PERL_GCC_BRACE_GROUPS_FORBIDDEN) # define ReREFCNT_inc(re) \ @@ -760,6 +784,8 @@ typedef struct regmatch_slab { #define PL_reg_curpm PL_reg_state.re_state_reg_curpm #define PL_reg_oldsaved PL_reg_state.re_state_reg_oldsaved #define PL_reg_oldsavedlen PL_reg_state.re_state_reg_oldsavedlen +#define PL_reg_oldsavedoffset PL_reg_state.re_state_reg_oldsavedoffset +#define PL_reg_oldsavedcoffset PL_reg_state.re_state_reg_oldsavedcoffset #define PL_reg_maxiter PL_reg_state.re_state_reg_maxiter #define PL_reg_leftiter PL_reg_state.re_state_reg_leftiter #define PL_reg_poscache PL_reg_state.re_state_reg_poscache @@ -781,6 +807,8 @@ struct re_save_state { PMOP *re_state_reg_curpm; /* from regexec.c */ char *re_state_reg_oldsaved; /* old saved substr during match */ STRLEN re_state_reg_oldsavedlen; /* old length of saved substr during match */ + STRLEN re_state_reg_oldsavedoffset; /* old offset of saved substr during match */ + STRLEN re_state_reg_oldsavedcoffset;/* old coffset of saved substr during match */ STRLEN re_state_reg_poscache_size; /* size of pos cache of WHILEM */ I32 re_state_reg_oldpos; /* from regexec.c */ I32 re_state_reg_maxiter; /* max wait until caching pos */ diff --git a/t/porting/known_pod_issues.dat b/t/porting/known_pod_issues.dat index f316fa7..ba4ccf6 100644 --- a/t/porting/known_pod_issues.dat +++ b/t/porting/known_pod_issues.dat @@ -266,7 +266,7 @@ pod/perlpacktut.pod Verbatim line length including indents exceeds 79 by 6 pod/perlperf.pod Verbatim line length including indents exceeds 79 by 154 pod/perlpodspec.pod Verbatim line length including indents exceeds 79 by 9 pod/perlpodstyle.pod Verbatim line length including indents exceeds 79 by 1 -pod/perlreapi.pod Verbatim line length including indents exceeds 79 by 17 +pod/perlreapi.pod Verbatim line length including indents exceeds 79 by 18 pod/perlrebackslash.pod Verbatim line length including indents exceeds 79 by 1 pod/perlref.pod Verbatim line length including indents exceeds 79 by 1 pod/perlreguts.pod Verbatim line length including indents exceeds 79 by 17 diff --git a/t/re/pat_advanced.t b/t/re/pat_advanced.t index 6692e1c..05cc191 100644 --- a/t/re/pat_advanced.t +++ b/t/re/pat_advanced.t @@ -1660,7 +1660,6 @@ $x='123'; print ">$1<\n"; EOP - local $::TODO = 'RT #86042'; fresh_perl_is(<<'EOP', ">abc<\n", {}, 'no mention of $&'); my $x; ($x='abc')=~/(abc)/g; diff --git a/t/re/pat_psycho.t b/t/re/pat_psycho.t index c5073a5..0433760 100644 --- a/t/re/pat_psycho.t +++ b/t/re/pat_psycho.t @@ -3,6 +3,9 @@ # This is a home for regular expression tests that don't fit into # the format supported by re/regexp.t. If you want to add a test # that does fit that format, add it to re/re_tests, not here. +# +# this file includes test that my burn a lot of CPU or otherwise be heavy +# on resources. Set env var $PERL_SKIP_PSYCHO_TEST to skip this file use strict; use warnings; @@ -21,7 +24,8 @@ BEGIN { } -plan tests => 11; # Update this when adding/deleting tests. +skip_all('$PERL_SKIP_PSYCHO_TEST set') if $ENV{PERL_SKIP_PSYCHO_TEST}; +plan tests => 15; # Update this when adding/deleting tests. run_tests() unless caller; @@ -29,16 +33,17 @@ run_tests() unless caller; # Tests start here. # sub run_tests { + print "# Set PERL_SKIP_PSYCHO_TEST to skip these tests\n"; - SKIP: { - print "# Set PERL_SKIP_PSYCHO_TEST to skip this test\n"; - my @normal = qw [the are some normal words]; - skip "Skipped Psycho", 2 * @normal if $ENV {PERL_SKIP_PSYCHO_TEST}; + # stress test tries + + my @normal = qw [the are some normal words]; local $" = "|"; + note "setting up trie psycho vars ..."; my @psycho = (@normal, map chr $_, 255 .. 20000); my $psycho1 = "@psycho"; for (my $i = @psycho; -- $i;) { @@ -48,13 +53,12 @@ sub run_tests { my $psycho2 = "@psycho"; foreach my $word (@normal) { - ok $word =~ /($psycho1)/ && $1 eq $word, 'Psycho'; - ok $word =~ /($psycho2)/ && $1 eq $word, 'Psycho'; + ok $word =~ /($psycho1)/ && $1 eq $word, qq{"$word" =~ /\$psycho1/}; + ok $word =~ /($psycho2)/ && $1 eq $word, qq{"$word" =~ /\$psycho1/}; } } - SKIP: { # stress test CURLYX/WHILEM. # @@ -63,8 +67,6 @@ sub run_tests { # CURLYX and WHILEM blocks, except those related to LONGJMP, the # super-linear cache and warnings. It executes about 0.5M regexes - skip "No psycho tests" if $ENV {PERL_SKIP_PSYCHO_TEST}; - print "# Set PERL_SKIP_PSYCHO_TEST to skip this test\n"; my $r = qr/^ (?: ( (?:a|z+)+ ) @@ -158,6 +160,49 @@ sub run_tests { } ok($ok, $msg); } + + + { + # these bits of test code used to run quadratically. If we break + # anything, they'll start to take minutes to run, rather than + # seconds. We don't actually measure times or set alarms, since + # that tends to be very fragile and prone to false positives. + # Instead, just hope that if someone is messing with + # performance-related code, they'll re-run the test suite and + # notice it suddenly takes a lot longer. + + my $x; + + $x = 'x' x 1_000_000; + 1 while $x =~ /(.)/g; + pass "ascii =~ /(.)/"; + + { + local ${^UTF8CACHE} = 1; # defeat debugging + $x = "\x{100}" x 1_000_000; + 1 while $x =~ /(.)/g; + pass "utf8 =~ /(.)/"; + } + + # run these in separate processes, since they set $& + + fresh_perl_is(<<'EOF', "ok\n", {}, 'ascii =~ /(.)/, mention $&'); +$&; +$x = 'x' x 1_000_000; +1 while $x =~ /(.)/g; +print "ok\n"; +EOF + + fresh_perl_is(<<'EOF', "ok\n", {}, 'utf8 =~ /(.)/, mention $&'); +$&; +local ${^UTF8CACHE} = 1; # defeat debugging +$x = "\x{100}" x 1_000_000; +1 while $x =~ /(.)/g; +print "ok\n"; +EOF + + + } } # End of sub run_tests 1; diff --git a/t/re/re_tests b/t/re/re_tests index f44bdc1..1aebbe6 100644 --- a/t/re/re_tests +++ b/t/re/re_tests @@ -163,6 +163,7 @@ ab|cd abcd y $& ab ()ef def y $&-$1 ef- ()ef def y $-[0] 1 ()ef def y $+[0] 3 +()\x{100}\x{1000} d\x{100}\x{1000} y $+[0] 3 ()ef def y $-[1] 1 ()ef def y $+[1] 1 *a - c - Quantifier follows nothing @@ -1702,5 +1703,10 @@ ab[c\\\](??{"x"})]{3}d ab\\](d y - - \W \x{200D} n - - /^(?d:\xdf|_)*_/i \x{17f}\x{17f}_ y $& \x{17f}\x{17f}_ +# +# check that @-, @+ count chars, not bytes; especially if beginning of +# string is not copied + +(\x{100}) \x{2000}\x{2000}\x{2000}\x{100} y $-[0]:$-[1]:$+[0]:$+[1] 3:3:4:4 # vim: softtabstop=0 noexpandtab diff --git a/t/re/reg_pmod.t b/t/re/reg_pmod.t index 301aeef..3190e03 100644 --- a/t/re/reg_pmod.t +++ b/t/re/reg_pmod.t @@ -11,38 +11,52 @@ use warnings; our @tests = ( # /p Pattern PRE MATCH POST - [ '/p', "456", "123-", "456", "-789"], - [ '(?p)', "456", "123-", "456", "-789"], - [ '', "(456)", "123-", "456", "-789"], - [ '', "456", undef, undef, undef ], + [ '/p', "345", "12-", "345", "-6789"], + [ '(?p)', "345", "12-", "345", "-6789"], + [ '', "(345)", undef, undef, undef ], + [ '', "345", undef, undef, undef ], ); -plan tests => 4 * @tests + 2; +plan tests => 14 * @tests + 2; my $W = ""; $SIG{__WARN__} = sub { $W.=join("",@_); }; sub _u($$) { "$_[0] is ".(defined $_[1] ? "'$_[1]'" : "undef") } -$_ = '123-456-789'; foreach my $test (@tests) { my ($p, $pat,$l,$m,$r) = @$test; - my $test_name = $p eq '/p' ? "/$pat/p" - : $p eq '(?p)' ? "/(?p)$pat/" - : "/$pat/"; + for my $sub (0,1) { + my $test_name = $p eq '/p' ? "/$pat/p" + : $p eq '(?p)' ? "/(?p)$pat/" + : "/$pat/"; + $test_name = "s$test_name" if $sub; - # - # Cannot use if/else due to the scope invalidating ${^MATCH} and friends. - # - my $ok = ok $p eq '/p' ? /$pat/p - : $p eq '(?p)' ? /(?p)$pat/ - : /$pat/ - => $test_name; - SKIP: { - skip "/$pat/$p failed to match", 3 - unless $ok; - is(${^PREMATCH}, $l,_u "$test_name: ^PREMATCH",$l); - is(${^MATCH}, $m,_u "$test_name: ^MATCH",$m ); - is(${^POSTMATCH}, $r,_u "$test_name: ^POSTMATCH",$r ); + # + # Cannot use if/else due to the scope invalidating ${^MATCH} and friends. + # + $_ = '12-345-6789'; + my $ok = + $sub ? + ( $p eq '/p' ? s/$pat/abc/p + : $p eq '(?p)' ? s/(?p)$pat/abc/ + : s/$pat/abc/ + ) + : + ( $p eq '/p' ? /$pat/p + : $p eq '(?p)' ? /(?p)$pat/ + : /$pat/ + ); + ok $ok, $test_name; + SKIP: { + skip "/$pat/$p failed to match", 6 + unless $ok; + is(${^PREMATCH}, $l,_u "$test_name: ^PREMATCH",$l); + is(${^MATCH}, $m,_u "$test_name: ^MATCH",$m ); + is(${^POSTMATCH}, $r,_u "$test_name: ^POSTMATCH",$r ); + is(length ${^PREMATCH}, length $l, "$test_name: ^PREMATCH length"); + is(length ${^MATCH}, length $m, "$test_name: ^MATCH length"); + is(length ${^POSTMATCH},length $r, "$test_name: ^POSTMATCH length"); + } } } is($W,"","No warnings should be produced"); -- Perl5 Master Repository
