In perl.git, the branch blead has been updated <http://perl5.git.perl.org/perl.git/commitdiff/103f89a72f1697bbe803bc0808ed936e31469d5d?hp=48414424a7e7693e51c809e8bb1ea9e55e5436c8>
- Log ----------------------------------------------------------------- commit 103f89a72f1697bbe803bc0808ed936e31469d5d Merge: 4841442 9fc7410 Author: David Mitchell <[email protected]> Date: Wed Mar 19 19:28:47 2014 +0000 [MERGE] audit anchoring in re_intuit_start() This series of commits audits all the code in re_intuit_start() that handles anchoring, e.g. /^.../, /^.../m, /\G.../. It fixes a couple of potential performance issues, and makes the code a lot safer with with \G. (Until recently, intuit was skipped altogether under \G, so its \G handling isn't well developed and tested.) commit 9fc7410eed533bed6f3760ecea84b34a7ae4c33b Author: David Mitchell <[email protected]> Date: Wed Mar 19 18:05:24 2014 +0000 re_intuit_start(): move comments abut IMPLICIT After the previous reworking of the PREGf_IMPLICIT tests, move commentary about PREGf_IMPLICIT closer to where its now relevant. M regexec.c commit 6e88169dce11675b992a0c387588bd5a90a69a72 Author: David Mitchell <[email protected]> Date: Wed Mar 19 17:37:20 2014 +0000 re_intuit_start(): don't unset MBOL on uselessness When BmUSEFUL() for a pattern goes negative, we unset ~RXf_USE_INTUIT. A bug fix in 2010 (c941595168) made this part of the code also remove the RXf_ANCH_MBOL at the same time, if PREGf_IMPLICIT was set. However, this was really just working round a bug in regexec_flags(). Once intuit was disabled, regexec_flags() would then start taking the buggy code path. This buggy code path was separately fixed in 2012 by 21eede782, so there's no longer any need to remove this flag. So don't. M regexec.c commit 7d2d37f505b570402d76d76649cc2464812a5881 Author: David Mitchell <[email protected]> Date: Wed Mar 19 16:44:25 2014 +0000 re_intuit_start(): change definition of ml_anch The ml_anch variable is supposed to indicate that a multi-line match is possible, i.e. that the regex is anchored to a \n. Currently we just do ml_anch = (prog->intflags & PREGf_ANCH_MBOL); However, MBOL is also set on /.*..../; the two cases are distinguished by adding the PREGf_IMPLICIT flag too. So at the moment we have lots of tests along the lines of if (ml_anch && !(prog->intflags & PREGf_IMPLICIT)) Simplify this by adding the IMPLICIT condition when initially calculating ml_anch, so there's no need to keep testing for it later. This also means that ml_anch actually means what it says now. M regexec.c commit 343c8a296e416def92250d436a4ed6ffb932445f Author: David Mitchell <[email protected]> Date: Wed Mar 19 16:34:28 2014 +0000 re_intuit_start(): check for IMPLICIT in abs anch The block of code that deals with absolute anchors looks like: if (...) { if (.... && !PREGf_IMPLICIT) ...; if (FOO) { assert(!PREGf_IMPLICIT); .... } } where the constraints of FOO imply PREGf_IMPLICIT, as shown by the assertion. Simplify this to: if (... && !PREGf_IMPLICIT) { if (....) ...; if (FOO) { .... } } M regexec.c commit e0eb31e7431d11b499dac9bfbae6dda175eab0fd Author: David Mitchell <[email protected]> Date: Wed Mar 19 14:58:19 2014 +0000 re_intuit_start(): check for IMPLICIT in stclass In the stclass block of code, there are various checks for anchored-ness. Add !(prog->intflags & PREGf_IMPLICIT) to these conditions, since from the perspective of these tests, we can only quit early etc if these are real rather than fake anchors. As it happens, this currently makes no logical difference, since an PREGf_IMPLICIT pattern starts with .*, and so more or less by definition can't have an stclass. So this commit is really only for logical completeness, and to allow us to change the definition of ml_anch shortly. M regexec.c commit e0362b861b7abfc19b44160e65c0d94b98ae924b Author: David Mitchell <[email protected]> Date: Tue Mar 18 23:36:20 2014 +0000 re_intuit_start(): use better limit on anch float The 'check' block of code has special handling for if the pattern is anchored; in this case, it allows us to set an upper bound on where a floating substring might match; e.g. in /^..\d?abc/ the floating string can't be more than 3 chars from the absolute beginning of the string. Similarly with /..\G\d?abc/, it can't be more than 3 chars from the start position (assuming that position been calculated correctly by the caller). However, the current code used rx_origin as the base for the offset calculation, rather than strbeg/strpos as appropriate. This meant that a) the first time round the loop, if strpos > strbeg, then the upper bound would be set higher than needed; b) if we ever go back to restart: with an incremented rx_origin, then the upper limit is recalculated with more wasted slack at the latter end. This commit changes the limit calculation, which reduces the following from second to milliseconds: $s = "abcdefg" x 1_000_000; $s =~ /^XX\d{1,10}cde/ for 1..100; It also adds a quick test to skip hopping when the result is likely to leave end_point unchanged, and adds an explicit test for !PREGf_IMPLICIT. This latter test isn't strictly necessary, as if PREGf_IMPLICIT were set, it implies that the pattern starts with '.*', which implies that prog->check_offset_max == SSize_t_MAX, which is already tested for. However, it makes the overall condition more comprehensible, and makes it more robust in the face of future changes. M regexec.c M t/re/pat.t commit 7bb3b9eb56c4b9374abfae3780eeb77c607bb1a5 Author: David Mitchell <[email protected]> Date: Tue Mar 18 21:32:39 2014 +0000 re_intuit_start(): do 'not at start' check on BOL The quick reject test that says "if a pattern has an absolute anchor, then immediately reject if strpos != strbeg", currently skips that test if PREGf_ANCH_GPOS is present. Instead, skip unless BOL or SBOL is present. This means that something like /^abc\G/ will still do the quick reject test. I can't think of any example that will actually give a measurable performance boost, and this is a fairly unlikely scenario, but the code is more logical this way, and makes it more robust against future changes (it's less likely to suddenly skip the quick test where it used to do it). I've also updated the commentary on why we don't do a quick /\G/ test akin to the /^/ test, and added some more info about why we test for the PREGf_IMPLICIT flag. And I've added an assert about PREGf_IMPLICIT too. M regexec.c commit d0d4464849e2b30aee89c175ccb5465795de10ce Author: David Mitchell <[email protected]> Date: Tue Mar 18 20:15:27 2014 +0000 re_intuit_start(): reduce scope of /^...$/m test Intuit has a quick reject test for a fixed pattern that is anchored at both ends. For example, with the pattern /^abcd$/, only the exact strings "abcd" or "abcd\n" will match; anything else, and the match immediately fails. A fix for [perl #115242] correctly made intuit skip the test in the presence of //m, since in this case the $ doesn't necessarily correspond to the end of the string. However, the fix was too wide in scope; it caused //m patterns to skip searching for a known string anchored just at the start, as well as one anchored at both ends. With this commit, the following code now runs in a few milliseconds rather than a few seconds on my machine: $s = "abcdefg" x 1_000_000; $s =~ /(?-m:^)abcX?fg/m for 1..100; M regexec.c M t/re/pat.t commit fe4f3442a740e4a233ab9610229aca3f4cf6a21f Author: David Mitchell <[email protected]> Date: Tue Mar 18 16:10:59 2014 +0000 re_intuit_start(): change !ml_anch debugging msg When MBOL (/^.../m) matching is skipped, the debugging output looks like: Starting position does not contradict /^/m... which sounds a bit like the \n test *was* done and passed, rather than the test being skipped. Change the message to: (multiline anchor test skipped) M regexec.c commit 8f10278a677a11d641ea4247a9d28b058d11c78a Author: David Mitchell <[email protected]> Date: Tue Mar 18 15:26:00 2014 +0000 re_intuit_start(): don't set ml_anch on BOL re_intuit_start() decided that a pattern was capable of being anchored after *any* \n in the string for a //m pattern that contains a BOL (rather than an MBOL). This can happen by embedding one regex in another for example. This is an incorrect assumption, and means that intuit() might try against every \n position in the string rather than just trying at the beginning. With this commit, the following code on my machine reduces in execution time from 7000ms to 5ms: my $r = qr/^abcd/; my $s = "abcd-xyz\n" x 500_000; $s =~ /$r\d{1,2}xyz/m for 1..200; M regexec.c M t/re/pat.t ----------------------------------------------------------------------- Summary of changes: regexec.c | 92 +++++++++++++++++++++++++++++++++----------------------------- t/re/pat.t | 17 +++++++++++- 2 files changed, 65 insertions(+), 44 deletions(-) diff --git a/regexec.c b/regexec.c index f606622..f291be0 100644 --- a/regexec.c +++ b/regexec.c @@ -752,21 +752,31 @@ Perl_re_intuit_start(pTHX_ }); if (prog->intflags & PREGf_ANCH) { /* Match at \G, beg-of-str or after \n */ - /* Check after \n? */ - ml_anch = ( (prog->intflags & PREGf_ANCH_MBOL) - || ((prog->intflags & PREGf_ANCH_BOL) && multiline)); - if (!ml_anch) { + /* ml_anch: check after \n? + * + * A note about IMPLICIT: on an un-anchored pattern beginning + * with /.*.../, these flags will have been added by the + * compiler: + * /.*abc/, /.*abc/m: PREGf_IMPLICIT | PREGf_ANCH_MBOL + * /.*abc/s: PREGf_IMPLICIT | PREGf_ANCH_SBOL + */ + ml_anch = (prog->intflags & PREGf_ANCH_MBOL) + && !(prog->intflags & PREGf_IMPLICIT); + + if (!ml_anch && !(prog->intflags & PREGf_IMPLICIT)) { /* we are only allowed to match at BOS or \G */ /* trivially reject if there's a BOS anchor and we're not at BOS. - * In the case of \G, we hope(!) that the caller has already - * set strpos to pos()-gofs, and will already have checked - * that this anchor position is legal. So we can skip it here. + * + * Note that we don't try to do a similar quick reject for + * \G, since generally the caller will have calculated strpos + * based on pos() and gofs, so the string is already correctly + * anchored by definition; and handling the exceptions would + * be too fiddly (e.g. REXEC_IGNOREPOS). */ - if ( !(prog->intflags & PREGf_ANCH_GPOS) - && !(prog->intflags & PREGf_IMPLICIT) /* not a real BOL */ - && (strpos != strbeg)) + if ( strpos != strbeg + && (prog->intflags & (PREGf_ANCH_BOL|PREGf_ANCH_SBOL))) { DEBUG_EXECUTE_r(PerlIO_printf(Perl_debug_log, " Not at start...\n")); @@ -784,22 +794,17 @@ Perl_re_intuit_start(pTHX_ * at position pos()-4+1, which lines up with the "a" */ if (prog->check_offset_min == prog->check_offset_max - && !(prog->intflags & PREGf_CANY_SEEN) - && ! multiline) /* /m can cause \n's to match that aren't - accounted for in the string max length. - See [perl #115242] */ + && !(prog->intflags & PREGf_CANY_SEEN)) { /* Substring at constant offset from beg-of-str... */ SSize_t slen = SvCUR(check); - char *s; - - s = HOP3c(strpos, prog->check_offset_min, strend); + char *s = HOP3c(strpos, prog->check_offset_min, strend); DEBUG_EXECUTE_r(PerlIO_printf(Perl_debug_log, " Looking for check substr at fixed offset %"IVdf"...\n", (IV)prog->check_offset_min)); - if (SvTAIL(check)) { + if (SvTAIL(check) && !multiline) { /* In this case, the regex is anchored at the end too, * so the lengths must match exactly, give or take a \n. * NB: slen >= 1 since the last char of check is \n */ @@ -891,19 +896,32 @@ Perl_re_intuit_start(pTHX_ } - /* if the regex is absolutely anchored to the start of the string, - * then check_offset_max represents an upper bound on the string - * where the substr could start */ + /* If the regex is absolutely anchored to either the start of the + * string (BOL,SBOL) or to pos() (ANCH_GPOS), then + * check_offset_max represents an upper bound on the string where + * the substr could start. For the ANCH_GPOS case, we assume that + * the caller of intuit will have already set strpos to + * pos()-gofs, so in this case strpos + offset_max will still be + * an upper bound on the substr. + */ if (!ml_anch && prog->intflags & PREGf_ANCH - && prog->check_offset_max != SSize_t_MAX - && start_shift < prog->check_offset_max) + && prog->check_offset_max != SSize_t_MAX) { SSize_t len = SvCUR(check) - !!SvTAIL(check); - end_point = HOP3lim(start_point, - prog->check_offset_max - start_shift, - end_point -len) - + len; + const char * const anchor = + (prog->intflags & PREGf_ANCH_GPOS ? strpos : strbeg); + + /* do a bytes rather than chars comparison. It's conservative; + * so it skips doing the HOP if the result can't possibly end + * up earlier than the old value of end_point. + */ + if ((char*)end_point - anchor > prog->check_offset_max) { + end_point = HOP3lim((U8*)anchor, + prog->check_offset_max, + end_point -len) + + len; + } } DEBUG_OPTIMISE_MORE_r({ @@ -1127,10 +1145,7 @@ Perl_re_intuit_start(pTHX_ /* handle the extra constraint of /^.../m if present */ - if (ml_anch && rx_origin != strbeg && rx_origin[-1] != '\n' - /* May be due to an implicit anchor of m{.*foo} */ - && !(prog->intflags & PREGf_IMPLICIT)) - { + if (ml_anch && rx_origin != strbeg && rx_origin[-1] != '\n') { char *s; DEBUG_EXECUTE_r(PerlIO_printf(Perl_debug_log, @@ -1202,8 +1217,7 @@ Perl_re_intuit_start(pTHX_ } else { DEBUG_EXECUTE_r(PerlIO_printf(Perl_debug_log, - " Starting position does not contradict /%s^%s/m...\n", - PL_colors[0], PL_colors[1])); + " (multiline anchor test skipped)\n")); } success_at_start: @@ -1276,7 +1290,8 @@ Perl_re_intuit_start(pTHX_ } DEBUG_EXECUTE_r( PerlIO_printf(Perl_debug_log, " This position contradicts STCLASS...\n") ); - if ((prog->intflags & PREGf_ANCH) && !ml_anch) + if ((prog->intflags & PREGf_ANCH) && !ml_anch + && !(prog->intflags & PREGf_IMPLICIT)) goto fail; /* Contradict one of substrings */ @@ -1394,19 +1409,10 @@ Perl_re_intuit_start(pTHX_ prog->check_substr = prog->check_utf8 = NULL; /* disable */ prog->float_substr = prog->float_utf8 = NULL; /* clear */ check = NULL; /* abort */ - /* XXXX If the check string was an implicit check MBOL, then we need to unset the relevant flag - see http://bugs.activestate.com/show_bug.cgi?id=87173 */ - if (prog->intflags & PREGf_IMPLICIT) { - prog->intflags &= ~PREGf_ANCH_MBOL; - /* maybe we have no anchors left after this... */ - if (!(prog->intflags & PREGf_ANCH)) - prog->extflags &= ~RXf_IS_ANCHORED; - } /* XXXX This is a remnant of the old implementation. It looks wasteful, since now INTUIT can use many other heuristics. */ prog->extflags &= ~RXf_USE_INTUIT; - /* XXXX What other flags might need to be cleared in this branch? */ } } diff --git a/t/re/pat.t b/t/re/pat.t index 79c7e6a..04f8b84 100644 --- a/t/re/pat.t +++ b/t/re/pat.t @@ -20,7 +20,7 @@ BEGIN { require './test.pl'; } -plan tests => 717; # Update this when adding/deleting tests. +plan tests => 721; # Update this when adding/deleting tests. run_tests() unless caller; @@ -755,6 +755,8 @@ sub run_tests { ok($_ =~ /^abc\Gdef$/, $message); pos = 3; ok($_ =~ /c\Gd/, $message); + pos = 3; + ok($_ =~ /..\GX?def/, $message); } { @@ -1538,6 +1540,19 @@ EOP $s .= "abx"; ok($s =~ /^ab.*x/m, "distant float with /m"); + my $r = qr/^abcd/; + $s = "abcd-xyz\n" x 500_000; + $s =~ /$r\d{1,2}xyz/m for 1..200; + pass("BOL within //m mustn't run slowly"); + + $s = "abcdefg" x 1_000_000; + $s =~ /(?-m:^)abcX?fg/m for 1..100; + pass("BOL within //m mustn't skip absolute anchored check"); + + $s = "abcdefg" x 1_000_000; + $s =~ /^XX\d{1,10}cde/ for 1..100; + pass("abs anchored float string should fail quickly"); + } # These are based on looking at the code in regcomp.c -- Perl5 Master Repository
