In perl.git, the branch blead has been updated

<http://perl5.git.perl.org/perl.git/commitdiff/103f89a72f1697bbe803bc0808ed936e31469d5d?hp=48414424a7e7693e51c809e8bb1ea9e55e5436c8>

- Log -----------------------------------------------------------------
commit 103f89a72f1697bbe803bc0808ed936e31469d5d
Merge: 4841442 9fc7410
Author: David Mitchell <[email protected]>
Date:   Wed Mar 19 19:28:47 2014 +0000

    [MERGE] audit anchoring in re_intuit_start()
    
    This series of commits audits all the code in re_intuit_start()
    that handles anchoring, e.g. /^.../, /^.../m, /\G.../.
    
    It fixes a couple of potential performance issues, and makes the code a
    lot safer with with \G. (Until recently, intuit was skipped altogether
    under \G, so its \G handling isn't well developed and tested.)

commit 9fc7410eed533bed6f3760ecea84b34a7ae4c33b
Author: David Mitchell <[email protected]>
Date:   Wed Mar 19 18:05:24 2014 +0000

    re_intuit_start(): move comments abut IMPLICIT
    
    After the previous reworking of the PREGf_IMPLICIT tests, move commentary
    about PREGf_IMPLICIT closer to where its now relevant.

M       regexec.c

commit 6e88169dce11675b992a0c387588bd5a90a69a72
Author: David Mitchell <[email protected]>
Date:   Wed Mar 19 17:37:20 2014 +0000

    re_intuit_start(): don't unset MBOL on uselessness
    
    When BmUSEFUL() for a pattern goes negative, we unset ~RXf_USE_INTUIT.
    
    A bug fix in 2010 (c941595168) made this part of the code also remove
    the RXf_ANCH_MBOL at the same time, if PREGf_IMPLICIT was set.
    
    However, this was really just working round a bug in regexec_flags().
    Once intuit was disabled, regexec_flags() would then start taking the
    buggy code path.
    
    This buggy code path was separately fixed in 2012 by 21eede782, so there's
    no longer any need to remove this flag. So don't.

M       regexec.c

commit 7d2d37f505b570402d76d76649cc2464812a5881
Author: David Mitchell <[email protected]>
Date:   Wed Mar 19 16:44:25 2014 +0000

    re_intuit_start(): change definition of ml_anch
    
    The ml_anch variable is supposed to indicate that a multi-line match is
    possible, i.e. that the regex is anchored to a \n.
    
    Currently we just do
    
        ml_anch = (prog->intflags & PREGf_ANCH_MBOL);
    
    However, MBOL is also set on /.*..../; the two cases are distinguished by
    adding the PREGf_IMPLICIT flag too.
    
    So at the moment we have lots of tests along the lines of
    
        if (ml_anch && !(prog->intflags & PREGf_IMPLICIT))
    
    Simplify this by adding the IMPLICIT condition when initially calculating
    ml_anch, so there's no need to keep testing for it later. This also means
    that ml_anch actually means what it says now.

M       regexec.c

commit 343c8a296e416def92250d436a4ed6ffb932445f
Author: David Mitchell <[email protected]>
Date:   Wed Mar 19 16:34:28 2014 +0000

    re_intuit_start(): check for IMPLICIT in abs anch
    
    The block of code that deals with absolute anchors looks like:
    
        if (...) {
            if (.... && !PREGf_IMPLICIT) ...;
            if (FOO) {
                assert(!PREGf_IMPLICIT);
                ....
            }
        }
    
    where the constraints of FOO imply PREGf_IMPLICIT, as shown by the
    assertion. Simplify this to:
    
        if (... && !PREGf_IMPLICIT) {
            if (....) ...;
            if (FOO) {
                ....
            }
        }

M       regexec.c

commit e0eb31e7431d11b499dac9bfbae6dda175eab0fd
Author: David Mitchell <[email protected]>
Date:   Wed Mar 19 14:58:19 2014 +0000

    re_intuit_start(): check for IMPLICIT in stclass
    
    In the stclass block of code, there are various checks for anchored-ness.
    Add !(prog->intflags & PREGf_IMPLICIT) to these conditions, since
    from the perspective of these tests, we can only quit early etc if these
    are real rather than fake anchors.
    
    As it happens, this currently makes no logical difference, since an
    PREGf_IMPLICIT pattern starts with .*, and so more or less by definition
    can't have an stclass.
    
    So this commit is really only for logical completeness, and to allow us to
    change the definition of ml_anch shortly.

M       regexec.c

commit e0362b861b7abfc19b44160e65c0d94b98ae924b
Author: David Mitchell <[email protected]>
Date:   Tue Mar 18 23:36:20 2014 +0000

    re_intuit_start(): use better limit on anch float
    
    The 'check' block of code has special handling for if the pattern is
    anchored; in this case, it allows us to set an upper bound on where a
    floating substring might match; e.g. in
    
        /^..\d?abc/
    
    the floating string can't be more than 3 chars from the absolute beginning
    of the string. Similarly with /..\G\d?abc/, it can't be more than 3 chars
    from the start position (assuming that position been calculated correctly
    by the caller).
    
    However, the current code used rx_origin as the base for the offset
    calculation, rather than strbeg/strpos as appropriate. This meant that
    
    a) the first time round the loop, if strpos > strbeg, then the upper bound
    would be set higher than needed;
    b) if we ever go back to restart: with an incremented rx_origin, then the
    upper limit is recalculated with more wasted slack at the latter end.
    
    This commit changes the limit calculation, which reduces the following
    from second to milliseconds:
    
        $s = "abcdefg" x 1_000_000;
        $s =~ /^XX\d{1,10}cde/ for 1..100;
    
    It also adds a quick test to skip hopping when the result is likely to
    leave end_point unchanged, and adds an explicit test for !PREGf_IMPLICIT.
    This latter test isn't strictly necessary, as if PREGf_IMPLICIT were set,
    it implies that the pattern starts with '.*', which implies that
    prog->check_offset_max == SSize_t_MAX, which is already tested for.
    However, it makes the overall condition more comprehensible, and makes it
    more robust in the face of future changes.

M       regexec.c
M       t/re/pat.t

commit 7bb3b9eb56c4b9374abfae3780eeb77c607bb1a5
Author: David Mitchell <[email protected]>
Date:   Tue Mar 18 21:32:39 2014 +0000

    re_intuit_start(): do 'not at start' check on BOL
    
    The quick reject test that says "if a pattern has an absolute anchor,
    then immediately reject if strpos != strbeg", currently skips that test
    if PREGf_ANCH_GPOS is present. Instead, skip unless BOL or SBOL is present.
    
    This means that something like /^abc\G/ will still do the quick reject
    test.
    
    I can't think of any example that will actually give a measurable
    performance boost, and this is a fairly unlikely scenario, but the code is
    more logical this way, and makes it more robust against future changes
    (it's less likely to suddenly skip the quick test where it used to do it).
    
    I've also updated the commentary on why we don't do a quick /\G/ test
    akin to the /^/ test, and added some more info about why we test for the
    PREGf_IMPLICIT flag.
    
    And I've added an assert about PREGf_IMPLICIT too.

M       regexec.c

commit d0d4464849e2b30aee89c175ccb5465795de10ce
Author: David Mitchell <[email protected]>
Date:   Tue Mar 18 20:15:27 2014 +0000

    re_intuit_start(): reduce scope of /^...$/m test
    
    Intuit has a quick reject test for a fixed pattern that is anchored at
    both ends. For example, with the pattern /^abcd$/, only the exact strings
    "abcd" or "abcd\n" will match; anything else, and the match immediately
    fails.
    
    A fix for [perl #115242] correctly made intuit skip the test in the
    presence of //m, since in this case the $ doesn't necessarily correspond
    to the end of the string.
    
    However, the fix was too wide in scope; it caused //m patterns to skip
    searching for a known string anchored just at the start, as well as one
    anchored at both ends.
    
    With this commit, the following code now runs in a few milliseconds rather
    than a few seconds on my machine:
    
        $s = "abcdefg" x 1_000_000;
        $s =~ /(?-m:^)abcX?fg/m for 1..100;

M       regexec.c
M       t/re/pat.t

commit fe4f3442a740e4a233ab9610229aca3f4cf6a21f
Author: David Mitchell <[email protected]>
Date:   Tue Mar 18 16:10:59 2014 +0000

    re_intuit_start(): change !ml_anch debugging msg
    
    When MBOL (/^.../m) matching is skipped, the debugging output looks like:
    
        Starting position does not contradict /^/m...
    
    which sounds a bit like the \n test *was* done and passed, rather than
    the test being skipped. Change the message to:
    
        (multiline anchor test skipped)

M       regexec.c

commit 8f10278a677a11d641ea4247a9d28b058d11c78a
Author: David Mitchell <[email protected]>
Date:   Tue Mar 18 15:26:00 2014 +0000

    re_intuit_start(): don't set ml_anch on BOL
    
    re_intuit_start() decided that a pattern was capable of being anchored
    after *any* \n in the string for a //m pattern that contains a BOL
    (rather than an MBOL). This can happen by embedding one regex in another
    for example.
    
    This is an incorrect assumption, and means that intuit() might try
    against every \n position in the string rather than just trying at the
    beginning. With this commit, the following code on my machine reduces in
    execution time from 7000ms to 5ms:
    
        my $r = qr/^abcd/;
        my $s = "abcd-xyz\n" x 500_000;
        $s =~ /$r\d{1,2}xyz/m for 1..200;

M       regexec.c
M       t/re/pat.t
-----------------------------------------------------------------------

Summary of changes:
 regexec.c  | 92 +++++++++++++++++++++++++++++++++-----------------------------
 t/re/pat.t | 17 +++++++++++-
 2 files changed, 65 insertions(+), 44 deletions(-)

diff --git a/regexec.c b/regexec.c
index f606622..f291be0 100644
--- a/regexec.c
+++ b/regexec.c
@@ -752,21 +752,31 @@ Perl_re_intuit_start(pTHX_
     });
 
     if (prog->intflags & PREGf_ANCH) { /* Match at \G, beg-of-str or after \n 
*/
-        /* Check after \n? */
-       ml_anch = (     (prog->intflags & PREGf_ANCH_MBOL)
-                   || ((prog->intflags & PREGf_ANCH_BOL) && multiline));
 
-       if (!ml_anch) {
+        /* ml_anch: check after \n?
+         *
+         * A note about IMPLICIT: on an un-anchored pattern beginning
+         * with /.*.../, these flags will have been added by the
+         * compiler:
+         *   /.*abc/, /.*abc/m:  PREGf_IMPLICIT | PREGf_ANCH_MBOL
+         *   /.*abc/s:           PREGf_IMPLICIT | PREGf_ANCH_SBOL
+         */
+       ml_anch =      (prog->intflags & PREGf_ANCH_MBOL)
+                   && !(prog->intflags & PREGf_IMPLICIT);
+
+       if (!ml_anch && !(prog->intflags & PREGf_IMPLICIT)) {
             /* we are only allowed to match at BOS or \G */
 
             /* trivially reject if there's a BOS anchor and we're not at BOS.
-             * In the case of \G, we hope(!) that the caller has already
-             * set strpos to pos()-gofs, and will already have checked
-             * that this anchor position is legal. So we can skip it here.
+             *
+             * Note that we don't try to do a similar quick reject for
+             * \G, since generally the caller will have calculated strpos
+             * based on pos() and gofs, so the string is already correctly
+             * anchored by definition; and handling the exceptions would
+             * be too fiddly (e.g. REXEC_IGNOREPOS).
              */
-            if (   !(prog->intflags & PREGf_ANCH_GPOS)
-                && !(prog->intflags & PREGf_IMPLICIT) /* not a real BOL */
-               && (strpos != strbeg))
+            if (   strpos != strbeg
+                && (prog->intflags & (PREGf_ANCH_BOL|PREGf_ANCH_SBOL)))
             {
                DEBUG_EXECUTE_r(PerlIO_printf(Perl_debug_log,
                                 "  Not at start...\n"));
@@ -784,22 +794,17 @@ Perl_re_intuit_start(pTHX_
              * at position pos()-4+1, which lines up with the "a" */
 
            if (prog->check_offset_min == prog->check_offset_max
-                && !(prog->intflags & PREGf_CANY_SEEN)
-                && ! multiline)   /* /m can cause \n's to match that aren't
-                                     accounted for in the string max length.
-                                     See [perl #115242] */
+                && !(prog->intflags & PREGf_CANY_SEEN))
             {
                /* Substring at constant offset from beg-of-str... */
                SSize_t slen = SvCUR(check);
-                char *s;
-
-               s = HOP3c(strpos, prog->check_offset_min, strend);
+                char *s = HOP3c(strpos, prog->check_offset_min, strend);
            
                 DEBUG_EXECUTE_r(PerlIO_printf(Perl_debug_log,
                     "  Looking for check substr at fixed offset %"IVdf"...\n",
                     (IV)prog->check_offset_min));
 
-               if (SvTAIL(check)) {
+               if (SvTAIL(check) && !multiline) {
                     /* In this case, the regex is anchored at the end too,
                      * so the lengths must match exactly, give or take a \n.
                     * NB: slen >= 1 since the last char of check is \n */
@@ -891,19 +896,32 @@ Perl_re_intuit_start(pTHX_
        }
 
 
-        /* if the regex is absolutely anchored to the start of the string,
-         * then check_offset_max represents an upper bound on the string
-         * where the substr could start */
+        /* If the regex is absolutely anchored to either the start of the
+         * string (BOL,SBOL) or to pos() (ANCH_GPOS), then
+         * check_offset_max represents an upper bound on the string where
+         * the substr could start. For the ANCH_GPOS case, we assume that
+         * the caller of intuit will have already set strpos to
+         * pos()-gofs, so in this case strpos + offset_max will still be
+         * an upper bound on the substr.
+         */
         if (!ml_anch
             && prog->intflags & PREGf_ANCH
-            && prog->check_offset_max != SSize_t_MAX
-            && start_shift < prog->check_offset_max)
+            && prog->check_offset_max != SSize_t_MAX)
         {
             SSize_t len = SvCUR(check) - !!SvTAIL(check);
-            end_point = HOP3lim(start_point,
-                            prog->check_offset_max - start_shift,
-                            end_point -len)
-                        + len;
+            const char * const anchor =
+                        (prog->intflags & PREGf_ANCH_GPOS ? strpos : strbeg);
+
+            /* do a bytes rather than chars comparison. It's conservative;
+             * so it skips doing the HOP if the result can't possibly end
+             * up earlier than the old value of end_point.
+             */
+            if ((char*)end_point - anchor > prog->check_offset_max) {
+                end_point = HOP3lim((U8*)anchor,
+                                prog->check_offset_max,
+                                end_point -len)
+                            + len;
+            }
         }
 
        DEBUG_OPTIMISE_MORE_r({
@@ -1127,10 +1145,7 @@ Perl_re_intuit_start(pTHX_
 
     /* handle the extra constraint of /^.../m if present */
 
-    if (ml_anch && rx_origin != strbeg && rx_origin[-1] != '\n'
-        /* May be due to an implicit anchor of m{.*foo}  */
-        && !(prog->intflags & PREGf_IMPLICIT))
-    {
+    if (ml_anch && rx_origin != strbeg && rx_origin[-1] != '\n') {
         char *s;
 
         DEBUG_EXECUTE_r(PerlIO_printf(Perl_debug_log,
@@ -1202,8 +1217,7 @@ Perl_re_intuit_start(pTHX_
     }
     else {
         DEBUG_EXECUTE_r(PerlIO_printf(Perl_debug_log,
-            "  Starting position does not contradict /%s^%s/m...\n",
-            PL_colors[0], PL_colors[1]));
+            "  (multiline anchor test skipped)\n"));
     }
 
   success_at_start:
@@ -1276,7 +1290,8 @@ Perl_re_intuit_start(pTHX_
            }
            DEBUG_EXECUTE_r( PerlIO_printf(Perl_debug_log,
                                "  This position contradicts STCLASS...\n") );
-            if ((prog->intflags & PREGf_ANCH) && !ml_anch)
+            if ((prog->intflags & PREGf_ANCH) && !ml_anch
+                        && !(prog->intflags & PREGf_IMPLICIT))
                goto fail;
 
            /* Contradict one of substrings */
@@ -1394,19 +1409,10 @@ Perl_re_intuit_start(pTHX_
            prog->check_substr = prog->check_utf8 = NULL;       /* disable */
            prog->float_substr = prog->float_utf8 = NULL;       /* clear */
            check = NULL;                       /* abort */
-           /* XXXX If the check string was an implicit check MBOL, then we 
need to unset the relevant flag
-                   see http://bugs.activestate.com/show_bug.cgi?id=87173 */
-            if (prog->intflags & PREGf_IMPLICIT) {
-                prog->intflags &= ~PREGf_ANCH_MBOL;
-                /* maybe we have no anchors left after this... */
-                if (!(prog->intflags & PREGf_ANCH))
-                    prog->extflags &= ~RXf_IS_ANCHORED;
-            }
            /* XXXX This is a remnant of the old implementation.  It
                    looks wasteful, since now INTUIT can use many
                    other heuristics. */
            prog->extflags &= ~RXf_USE_INTUIT;
-           /* XXXX What other flags might need to be cleared in this branch? */
        }
     }
 
diff --git a/t/re/pat.t b/t/re/pat.t
index 79c7e6a..04f8b84 100644
--- a/t/re/pat.t
+++ b/t/re/pat.t
@@ -20,7 +20,7 @@ BEGIN {
     require './test.pl';
 }
 
-plan tests => 717;  # Update this when adding/deleting tests.
+plan tests => 721;  # Update this when adding/deleting tests.
 
 run_tests() unless caller;
 
@@ -755,6 +755,8 @@ sub run_tests {
        ok($_ =~ /^abc\Gdef$/, $message);
        pos = 3;
        ok($_ =~ /c\Gd/, $message);
+       pos = 3;
+       ok($_ =~ /..\GX?def/, $message);
     }
 
     {
@@ -1538,6 +1540,19 @@ EOP
         $s .= "abx";
         ok($s =~ /^ab.*x/m, "distant float with /m");
 
+        my $r = qr/^abcd/;
+        $s = "abcd-xyz\n" x 500_000;
+        $s =~ /$r\d{1,2}xyz/m for 1..200;
+        pass("BOL within //m  mustn't run slowly");
+
+        $s = "abcdefg" x 1_000_000;
+        $s =~ /(?-m:^)abcX?fg/m for 1..100;
+        pass("BOL within //m  mustn't skip absolute anchored check");
+
+        $s = "abcdefg" x 1_000_000;
+        $s =~ /^XX\d{1,10}cde/ for 1..100;
+        pass("abs anchored float string should fail quickly");
+
     }
 
     # These are based on looking at the code in regcomp.c

--
Perl5 Master Repository

Reply via email to