[perl.git] branch sprout/regexp, created. v5.19.2-152-gf443632

Father Chrysostomos Fri, 26 Jul 2013 01:59:21 -0700

In perl.git, the branch sprout/regexp has been created

<http://perl5.git.perl.org/perl.git/commitdiff/f443632cbc812fcf86807b3215c34104d831c414?hp=0000000000000000000000000000000000000000>


        at  f443632cbc812fcf86807b3215c34104d831c414 (commit)

- Log -----------------------------------------------------------------
commit f443632cbc812fcf86807b3215c34104d831c414
Author: Father Chrysostomos <[email protected]>
Date:   Thu Jul 25 00:41:07 2013 -0700

    [perl #116907] Allow //g matching past 2**31 threshold
    
    Change the internal fields for storing positions so that //g in scalar
    context can move past the 2**31 character threshold.  Before this com-
    mit, the numbers would wrap, resulting in assertion failures.
    
    The changes in this commit are only enough to get the added test pass-
    ing.  Stay tuned for more.

M       embed.fnc
M       pp_hot.c
M       proto.h
M       regcomp.c
M       regexec.c
M       regexp.h
M       t/bigmem/regexp.t

commit a623047ae90a757a786f9f95f86127fc8d1864c0
Author: Father Chrysostomos <[email protected]>
Date:   Wed Jul 24 18:14:06 2013 -0700

    pp_hot.c: Show lengths in -Dr output for minlen optimisation

M       pp_hot.c

commit 7a98d1ae1be91d8290ed8e6d25283f45b25ada82
Author: Father Chrysostomos <[email protected]>
Date:   Wed Jul 24 14:23:54 2013 -0700

    Stop minlen regexp optimisation from rejecting long strings
    
    This fixes #112790 and part of #116907.
    
    The length of the string is cast to I32, so it wraps and end up less
    than the minimum length.
    
    I donât know whether the type of minlen can change without breaking
    the regexp API, so simply skip this optimisation if minlen itself
    wraps and becomes negative (not really testable, since the regexp com-
    piler consumes far too much memory if the regexp itself is 2GB long).

M       MANIFEST
M       pp_hot.c
A       t/bigmem/regexp.t

commit c4b48cc39cc42bd5230da33a3f258d0701590e5f
Author: Father Chrysostomos <[email protected]>
Date:   Wed Jul 24 13:22:54 2013 -0700

    Change mg_len >= 0 to mg_len != -1 for pos
    
    This allows 64, rather than 63 bits for storing the length on 64-bit
    systems (32 bits, rather than 31 on 32-bit).
    
    Some code paths already do this.  Use this for all of them.
    
    I cannot really test this, as it would require allocating too
    much memory.
    
    Also, the regexp engine itself cannot handle strings that long, at
    least not yet.

M       pp_hot.c
M       regexec.c

commit 2b2b539b8f588e2f5d55442a2c92946603eb5489
Author: Father Chrysostomos <[email protected]>
Date:   Tue Jul 23 13:15:34 2013 -0700

    Stop pos() from being confused by changing utf8ness
    
    The value of pos() is stored as a byte offset.  If it is stored on a
    tied variable or a reference (or glob), then the stringification could
    change, resulting in pos() now pointing to a different character off-
    set or pointing to the middle of a character:
    
    $ ./perl -Ilib -le '$x = bless [], chr 256; pos $x=1; bless $x, a; print 
pos $x'
    2
    $ ./perl -Ilib -le '$x = bless [], chr 256; pos $x=1; bless $x, "\x{1000}"; 
print pos $x'
    Malformed UTF-8 character (unexpected end of string) in match position at 
-e line 1.
    0
    
    So pos() should be stored as a character offset.
    
    The regular expression engine expects byte offsets always, so allow it
    to store bytes when possible (a pure non-magical string) but use char-
    acters otherwise.
    
    This does result in more complexity than I should like, but the alter-
    native (always storing a character offset) would slow down regular
    expressions, which is a big no-no.

M       dump.c
M       embed.fnc
M       embed.h
M       ext/Devel-Peek/t/Peek.t
M       inline.h
M       mg.c
M       mg.h
M       pp.c
M       pp_ctl.c
M       pp_hot.c
M       proto.h
M       regexec.c
M       regexp.h
M       sv.c
M       sv.h
M       t/op/pos.t

commit e9e2b913f131c853d4dc9cf4422b299d458131e0
Author: David Mitchell <[email protected]>
Date:   Sun Jul 21 11:57:22 2013 +0100

    regexec(): add quick-fail test for anchored \G
    
    under anchored \G, e.g. /ab\G/, we know that the start of the match must
    be at (ganch-gofs); so fail quickly if that's off the beginning of the
    string; or use it as the start point otherwise.

M       regexec.c

commit 72499cd457d8f2f9ae97fdf505b01ddc1ca83f9f
Author: David Mitchell <[email protected]>
Date:   Sun Jul 21 11:31:21 2013 +0100

    regexec: swap ganch setting and gofs offsetting
    
    These two block of code are currently independent of each other, but swap
    them round so that the calculated ganch value will be available for more
    more clever gofs offset processing.

M       regexec.c

commit 5efa15b4d6a534e4a65e70a33510b9664c051a25
Author: David Mitchell <[email protected]>
Date:   Sat Jul 20 16:16:10 2013 +0100

    fix COW match capture optimisation
    
    When an SV matches, a new SV is created which is a COW copy of the original
    SV, and stored in prog->saved_copy, then prog->subbeg is set to point to
    the (shared) PVX buffer.
    
    Earlier in this branch I introduced an optimisation that skipped freeing
    the old saved_copy and creating a new COW SV each time, if the old
    saved_copy SV was already a shared copy of the SV being matched.
    
    So far so good, except that I implemented it wrongly: if non-COW
    matches (which malloc() subbeg) are interspersed with COW matches,
    then the subbeg of the COW and the malloced subbeg get mixed up and
    AddressSanitizer throws a wobbly.
    
    The fix is simple: in the optimised branch, we still need to free subbeg
    if RXp_MATCH_COPIED is true, then reassign it.

M       regexec.c
M       t/re/pat.t

commit 0ceb94311518f555033ab8a2677f0f80f91801b4
Author: David Mitchell <[email protected]>
Date:   Fri Jul 19 22:00:23 2013 +0100

    regexec(): avoid uninit use of var
    
    clang pointed out that
    
        if (...)
        goto phooey;
        oldsave = PL_savestack_ix;
        ...
      phooey:
        LEAVE_SCOPE(oldsave);
    
    could use oldsave uninitialised. clang 1, dave 0.

M       regexec.c

commit a76dc3648eaf93c41d2ef5aa894f91c35c590abc
Author: David Mitchell <[email protected]>
Date:   Fri Jul 19 21:44:32 2013 +0100

    fix /test_bootstrap.t under -DPERL_NO_COW
    
    These tests check whether "require test.pl" inadvertently use $& etc.
    They do that by doing a simple pattern match "Perl/ =~ /Perl/, then
    checking that eval '$&' returns undef.
    
    This has always been a dodgy thing o rely on. It turns out that under
    5.18.0, whether eval '$&' was undef depended on whether the intuit-only-
    match codepath was taken. So:
    
    "Perl" =~ /Perl/;   eval '$&';   # intuit-only match: returned undef
    "Perl" =~ /\w+\w*/; eval '$&';   # regexec() match: returned 'Perl'.
    
    In this branch, the same code path is now used for both intuit() and
    regexec() matches, so both return 'Perl'.
    
    So, abandon this approach to the test, and instead read in tets.pl and
    grep for the test '$&' etc.
    Requires  minor fixup to test.pl to avoid a false positive.

M       t/porting/test_bootstrap.t
M       t/test.pl

commit c4d6fd6f16d3ddcdde119935877a8251c20ec144
Author: David Mitchell <[email protected]>
Date:   Fri Jul 19 20:07:56 2013 +0100

    fix build under -DPERL_NO_COW
    
    An earlier commit in this branch fixed up capturing on intuit-only
    matches.
    However, the new code grabbed the buffer before setting offs[0].start,
    offs[0].end.  Under old-style non-COW, it uses offs[0].start and end to
    determine what subset to the buffer to capture. So set them first!

M       regexec.c

commit bdf91c4cfc7d0bc5bbdae498fdb95c446942f3ec
Author: David Mitchell <[email protected]>
Date:   Fri Jul 19 18:12:53 2013 +0100

    regexec(): access extflags directly
    
    Some bits of code that had been moved from pp_match() etc into
    regexec() still used the external API to access flags, i.e.
    
        RX_EXTFLAGS(rx)
    
    Replace those uses with the more direct
    
        prog->extflags
    
    for consistency with the rest of the code.

M       regexec.c

commit 0b932002435618399cb2fe44eca93883d1c2f390
Author: David Mitchell <[email protected]>
Date:   Fri Jul 19 17:44:34 2013 +0100

    regexec(): tidy up ganch-setting code
    
    Its a bit verbose with tons of debugging statements. Hard to see the wood
    for the trees.

M       regexec.c

commit 15fd9450cf140f8195348a30ad8fc830466f48bb
Author: David Mitchell <[email protected]>
Date:   Fri Jul 19 17:29:01 2013 +0100

    regexec(): merge the 2 RXf_GPOS_SEEN setup blocks
    
    Change
    
        if (RXf_GPOS_SEEN) {
            ... adjust startpos ...
        }
        ...
        if (RXf_GPOS_SEEN) {
            ... calculate ganch ...
        }
    
    to
    
        if (RXf_GPOS_SEEN) {
            ... adjust startpos ...
            ... calculate ganch ...
        }
        ...
    
    Should contain no functional changes.
    
    With this commit (building on many previous ones in this branch), all the
    setup for \G is now in one place in regexec(), rather than scattered
    across various places in pp_match(), regexec() etc.

M       regexec.c

commit b371a65162678eca951c726890eeeaa595b63748
Author: David Mitchell <[email protected]>
Date:   Fri Jul 19 17:10:04 2013 +0100

    regexec(): simplify RXf_ANCH_GPOS pos calc
    
    There are two bits of code in regexec() that do special handling for
    RXf_ANCH_GPOS:
    
    First, after setting ganch from pos(), it does a couple of quick-fail
    checks:
        fail if s > ganch
        fail if (ganch - gofs) < strbeg
    at this point it also updates s to be ganch - gofs, although confusingly,
    s in not subsequently used.
    
    Second, when about to call regtry, it calculates a new  start value
    (ignoring the old one, s):
    
        tmps_s = ganch - gofs;
    
    then checks:
        fail if tmp_s < strbeg
    
    As can be seen, these two sets of tests essentially partially duplicate
    each other.
    
    This commit moves all the work to the second block of code, which
    simplifies things, and makes the first block of code purely about
    calculating ganch.
    
    Note that the new condition added by this commit in the second block,
    
        fail if s > tmp_s (i.e if s > (ganch - gofs))
    
    subsumes both previous conditions, since
    a) it is stronger than s > ganch
    b) s will be >= strbeg, so tmp_s >= strbeg

M       regexec.c

commit 1f59f4aba86381272b1f5f47bdb60302c827e19a
Author: David Mitchell <[email protected]>
Date:   Fri Jul 19 16:37:04 2013 +0100

    regexec(): use regtry(&s) not regtry(&startpos)
    
    regexec() has several cases such as anchored, floating etc, and for each
    of these it will call regtry() one or more times to match at likely
    starting positions. In all cases but one, it calls regtry(&s), where
    s is the current start position that moves as we try new positions. In
    the one final case it uses startpos instead, which is supposed to be static.
    The actual code looks like:
    
        if (s == startpos && regtry(&startpos))
    
    which might seem harmless, but under things like (*COMMIT), regtry can
    update the pointer value (which is why its address is taken). So in this
    (obscure) case, the wrong pointer gets updated.

M       regexec.c
M       t/re/pat_advanced.t

commit ce632ead2b2251d4ac8c1bcfb1ca87859372ada4
Author: David Mitchell <[email protected]>
Date:   Tue Jul 16 16:31:04 2013 +0100

    s/.(?=.\G)/X/g: refuse to go backwards
    
    On something like:
    
        $_ = "123456789";
        pos = 6;
        s/.(?=.\G)/X/g;
    
    each iteration could in theory start with pos one character to the left
    of the previous position, and with the substitution replacing bits that
    it has already replaced.  Since that way madness lies, ban any attempt by
    s/// to substitute to the left of a previous position.
    
    To implement this, add a new flag to regexec(), REXEC_FAIL_ON_UNDERFLOW.
    This tells regexec() to return failure even if the match itself succeeded,
    but where the start of $& is before the passed stringarg point.
    
    This change caused one existing test to fail (which was added about a year
    ago):
    
        $_="abcdef";
        s/bc|(.)\G(.)/$1 ? "[$1-$2]" : "XX"/ge;
        print; # used to print "aXX[c-d][d-e][e-f]"; now prints "aXXdef"
    
    I think that that test relies on ambiguous behaviour, and that my change
    makes things saner.
    
    Note that s/// with \G is generally very under-tested.

M       pod/perlre.pod
M       pp_ctl.c
M       pp_hot.c
M       regexec.c
M       regexp.h
M       t/re/subst.t

commit b142539b8574c0028a76b9c8093282125ddf9a41
Author: David Mitchell <[email protected]>
Date:   Mon Jul 15 21:57:34 2013 +0100

    pp_subst: don't use REXEC_COPY_STR on 2nd match
    
    pp_subst() sets the REXEC_COPY_STR flag on the first match. On the second
    and subsequent matches, it doesn't set it in two out three of the branches
    (including pp_susbstcont) where it calls CALLREGEXEC().
    The one place where it *does* set it is a (harmless) mistake, since regexec
    ignores REXEC_COPY_STR if REXEC_NOT_FIRST is set (which is it is, on all 3
    brnanches).
    
    So unset REXEC_COPY_STR in the third branch too, for consistency

M       pp_hot.c

commit 7a8a3f579a18be445bbe0eee0def2f1df59b043c
Author: David Mitchell <[email protected]>
Date:   Mon Jul 15 21:24:02 2013 +0100

    pp_subst: combine 3 small elsif blocks into 1
    
    and slightly reduce the scope of the temporary i var.

M       pp_hot.c

commit 52c1af7a1e0360a8f7eb0129faf1e6b5d68d95d3
Author: David Mitchell <[email protected]>
Date:   Mon Jul 15 21:10:47 2013 +0100

    pp_subst: remove one use of 'm' local var

M       pp_hot.c

commit d709c126a9282cc02e360f598dbeb79ca667917f
Author: David Mitchell <[email protected]>
Date:   Mon Jul 15 21:00:49 2013 +0100

    pp_subst: reduce scope of 'i' variable
    
    it's just used a temporary var in a few blocks; declare it individually
    in each block rather than being scoped to the whole function.

M       pp_hot.c

commit 9d1e8f7f26d0d3909059d403714e989e74613dbc
Author: David Mitchell <[email protected]>
Date:   Mon Jul 15 20:37:44 2013 +0100

    pp_subst: reduce scope of 'm' var
    
    its mainly just a temporary local var; declare it individually within each
    scope that makes use of it.

M       pp_hot.c

commit a7c17f525bf5aa7a82eb6aab070758344474de1d
Author: David Mitchell <[email protected]>
Date:   Mon Jul 15 20:17:51 2013 +0100

    pp_subst: set/use s,m vars near where they're used
    
    This should be just a cosmetic change; but basically change stuff like
    
        m = orig;
        s = foo();
        ... lots of lines not using s or m ...
        bar(m,s)
        ... more stuff using s ...
    
    to
    
        ... lots of lines not using s or m ...
        s = foo();
        bar(orig,s)
        ... more stuff using s ...
    
    This is part of few commits to generally clean up the scope and
    comprehensibility of the vars within pp_subst

M       pp_hot.c

commit ff7d5b3c3758fc69daa282af4c3f0498ffd5421b
Author: David Mitchell <[email protected]>
Date:   Mon Jul 15 19:54:53 2013 +0100

    pp_subst: reduce scope of 'd' variable
    
    It's just used as a temporary value in two branches;
    so make it a local var in each of those branches.

M       pp_hot.c

commit bda39f9d93ab0c7c8829c22cbd76bd06f55b3a89
Author: David Mitchell <[email protected]>
Date:   Mon Jul 15 19:16:10 2013 +0100

    pp_subst: cosmetic re-arrangement of vars
    
    since 'orig' always points to the start of the string, while 's' varies,
    change
    
        s = SvPV_nomg(...);
        ...other stuff using value of s ...
        orig = s
        ...
    
    to
    
        orig = SvPV_nomg(...);
        ...other stuff using value of orig ...
        s = orig
        ...
    
    No functional change, just reduces the cognitive load slightly
    
    also adds some comments as to what force_on_match is about.

M       pp_hot.c

commit 7c7929a1e812b6e9737c918e668612aec0bc4ad0
Author: David Mitchell <[email protected]>
Date:   Sat Jul 13 21:18:50 2013 +0100

    regexec(): fix ganch and till settings
    
    Since startpos is now the \G-adjusted start position, use the real start
    position instead (stringarg) when setting reginfo->till, and when setting
    ganch in the non-pos case.
    
    This stops this infinitely looping:
    
        $_ = "x"; pos = 1; @a = /x\G/g

M       regexec.c
M       t/re/pat.t

commit d9c9ab65e37f5873261231df22d7a2b8ba3b6786
Author: David Mitchell <[email protected]>
Date:   Sat Jul 13 20:16:19 2013 +0100

    regexec(): skip second intuit() call
    
    A few commits ago, the call to intuit() done by the *callers* of
    regexec() was moved into regexec() itself. Since regexec() could also call
    intuit(), this temporarily led to the situation where intuit() was
    harmlessly but inefficiently called twice. The last few commits have
    removed the subtle differences between the conditions for each of the two
    call points, so the second call to intuit() can now be removed.
    
    A consequence of this is that we have to adjust the usage of the
    'startpos' verses 's' variables; the original intent was that
    startpos was constant, while s moved forward in the string after intuit
    etc. This got a bit lost during the recent reorganisation, but is now
    re-established. (startpos isn't quite constant: it will contain any
    initial adjustment for \G.)

M       regexec.c

commit 4072a57e3badb20c66530085e34addfbdcbb3c57
Author: David Mitchell <[email protected]>
Date:   Fri Jul 19 02:08:56 2013 +0100

    fix intuit_start() with \G
    
    Intuit assumed that any anchor, including \G, anchored at BOS or after \n.
    This obviously isn't the case for \G, so exclude RXf_ANCH_GPOS from the
    RXf_ANCH branch.
    
    This has never been spotted before, since intuit used to be skipped when
    \G was present.

M       regexec.c
M       t/re/pat.t

commit bc22e53bcc667a53b7279855d400fd1cc28eb855
Author: David Mitchell <[email protected]>
Date:   Sat Jul 13 15:23:59 2013 +0100

    enable intuit under anchored \G, and fix a bug
    
    Since 1999, regcomp has had approximately the following comment and code:
    
        /* XXXX Currently intuiting is not compatible with ANCH_GPOS.
           This should be changed ASAP!  */
        if ((r->check_substr || r->check_utf8) && !(r->extflags & 
RXf_ANCH_GPOS)) {
            r->extflags |= RXf_USE_INTUIT;
            ....
    
    However, it appears that since that time, intuit has had (at least some)
    support for achored \G added.
    Note also that the RXf_USE_INTUIT flag (up until a few commits go)
    was only used by *callers* of regexec() to decide whether to call intuit()
    first; regexec() itself also internally calls intuit() on occasion, and in
    those cases it directly checks just the check_substr and check_utf8 fields,
    rather than the RXf_USE_INTUIT flag; so in those cases it's using intuit
    even in the presence of anchored \G.
    
    So, in the grand perl tradition of "make the change and see if anything
    in the test suite breaks", that's what I've done for this commit
    (i.e. removed the RXf_ANCH_GPOS check above).
    
    So intuit is now normally called even in the presence of anchored \G.
    This means that something like "aaaa" =~ /\G.*xx/ will now quickly fail in
    intuit rather than more slowly failing in regmatch().
    
    Note that I have no actual knowledge of whether intuit is *really*
    anchored-\G-safe.
    
    As it happens one thing in the test suite did break, and this was due to
    the following code, added back in 1997:
    
        if (
            ....
            && !((RExC_seen & REG_SEEN_GPOS) || (r->extflags & RXf_ANCH_GPOS)))
       )
           r->extflags |= RXf_CHECK_ALL;
    
    It was clearly meant to say that if either of those \G flags were present,
    don't set the RXf_CHECK_ALL flag (which enables intuit-only matches).
    But the '!' was set to cover the first condition only, rather than both.
    Presumably this had never been spotted before due to skipping intuit under
    anchored \G.
    
    [Actually this commit broke some other stuff too, not covered by the test
    suite.  See the next commit. Hooray for git rebase -i and history
    re-writing!]

M       regcomp.c

commit ff907a985cb5f7d679be01146e8cf4f34912f15d
Author: David Mitchell <[email protected]>
Date:   Wed Jul 10 20:00:22 2013 +0100

    regexec_flags(): remove vestigial scream support
    
    intuit has an arg (data) that used to be used for scream stuff, but which
    is now unused. However, Perl_regexec_flags() still went to the trouble of
    setting up that parameter when calling intuit. So stop doing that.

M       regexec.c

commit 8a03003a7ae8882b07e5b45bc9bd894a0ff34ee5
Author: David Mitchell <[email protected]>
Date:   Wed Jul 10 14:28:02 2013 +0100

    regexec_flags(): keep stringarg constant
    
    stringarg is the arg passed to Perl_regexec_flags() to indicate where to
    start matching. Currently the code adjusts this under \G, then copies it
    to startpos, then later tinkers with startpos further.
    
    Change it so that stringarg is never changed, and all the adjusting is to
    startpos. Shouldn't make any logical difference, but makes the code
    slightly cleaner and easier to understand (and doesn't require minend to
    be adjusted any more).

M       regexec.c

commit d7414bad414ec67a604efc862b7f7aeacbe2608a
Author: David Mitchell <[email protected]>
Date:   Wed Jul 10 13:35:51 2013 +0100

    regexec_flags(): use result of intuit_start()
    
    When I moved the call to re_intuit_start() into Perl_regexec_flags()
    a few commits earlier, I assigned the return value to the wrong variable,
    so a subsequent match would still start at the beginning, not at the
    intuited start point. This commit corrects that, by updating startpos
    rather than stringarg.

M       regexec.c

commit 64cc36b94f4c6af62746bf86b2781349100a6f64
Author: David Mitchell <[email protected]>
Date:   Wed Jul 10 11:13:38 2013 +0100

    pp_match: simplify pos()-getting code
    
    The previous commit removed the \G handling from pp_match; most of what's
    left in that code block is redundant code that just sets curpos under all
    conditions. So tidy it up.

M       pp_hot.c

commit 802b90a875fdc4e9d84fe3744c4dab43dcd76fc4
Author: David Mitchell <[email protected]>
Date:   Sun Jun 23 13:30:59 2013 +0100

    regexec: handle \G ourself, rather than in callers
    
    Normally a /g match starts its processing at the previous pos() (or at
    char 0 if pos is not set); however in the case of something like /abc\G/
    we actually need to start 3 characters before pos. This has been handled
    by the *callers* of regexec() subtracting prog->gofs from the stringarg
    arg before calling it, or by setting stringarg to strbeg for floating,
    such as /\w+\G/.
    
    This is clearly wrong: the callers of regexec() shouldn't need to worry
    about the details of getting \G right: move this code into regexec()
    itself.
    
    (Note that although this commit passes all tests, it quite possibly isn't
    logically correct. It will get fixed up further during the next few
    commits)

M       pp_ctl.c
M       pp_hot.c
M       regexec.c
M       regexp.h

commit 8543d7c643a55556db1cebb9f4cb15d21442d7d5
Author: Yves Orton <[email protected]>
Date:   Sun Sep 16 14:25:02 2012 +0200

    fix 114884 positive GPOS lookbehind regex substitution failure
    
    This also corrects a test added in 2c2969659ae1c534e7f3fac9e7a7d186defd9943 
which was
    arguably wrong. The details of \G are a bit fuzzy, and IMO its a little 
hard to say exactly
    what is right, although it generally is clear what is wrong.

M       pp_ctl.c
M       t/re/subst.t

commit fa282eba82dcc6c70e1d753852f3cdb2266ee823
Author: David Mitchell <[email protected]>
Date:   Sat Jun 22 17:24:13 2013 +0100

    pp_match(): don't set REXEC_IGNOREPOS on 1st iter
    
    Currently all core callers of regexec set both the
    REXEC_IGNOREPOS and REXEC_NOT_FIRST flags, or neither, depending
    on whether this is the first or subsequent iteration of a //g;
    *except* for one place in pp_match(), where REXEC_IGNOREPOS is set
    on the first iteration for the one specific case of /g with an anchored
    \G.
    
    Now AFAICT this makes no difference, because the starting position
    as calculated by regexec() still comes to the same value of
    (strbeg + pos -gofs), and the same value og ganch calculated.
    
    Also in the commit that added this particular use of the flag to pp_match,
    (0ef3e39ecdfec), removing the flag makes no difference to the passing or
    not of the new test case.
    
    So I don't understand what its purpose it, and its possibly a mistake.
    Removing it now makes the code simpler for further clearup.

M       pp_hot.c

commit e57a613c084b5ce7e38bd0b9a6fb41d329273028
Author: David Mitchell <[email protected]>
Date:   Fri Jun 21 21:44:45 2013 +0100

    pp_match(): stop setting $-[0] before regexec()
    
    It doesn't actually achieve anything.

M       pp_hot.c

commit 773cf7ce8f9605b00493fe65e5cd89141b786487
Author: David Mitchell <[email protected]>
Date:   Fri Jun 21 20:16:30 2013 +0100

    pp_match: avoid setting $+[0]
    
    This function sometimes set $+[0] to pos() before calling regexec().
    This value isn't used by regexec(), and was really just a way of updating
    the new start position for //g. Replace it with a local var instead.

M       pp_hot.c

commit 260faa9b8d1032a39dc192362a574cd1777492ea
Author: David Mitchell <[email protected]>
Date:   Fri Jun 21 20:00:01 2013 +0100

    pp_match(): eliminate unused t variable
    
    and restrict usage of s variable

M       pp_hot.c

commit 71bc8564fa2079e6824ff5875067f07e313a8d54
Author: David Mitchell <[email protected]>
Date:   Thu Jun 20 14:54:44 2013 +0100

    pp_match(): skip passing gpos arg to regexec()
    
    In one specific case, pp_match() passes the value of pos() to regexec()
    via the otherwise unused 'data' arg.
    
    It turns out that pp_match() only passes this value when it exists and is
    >= 0, while regexec() only uses it when there's no pos magic or pos() < 0.
    
    So its never used as far as I can tell.
    
    So, strip it for now.

M       pp_hot.c
M       regexec.c

commit 31cd5d1cf31e545a135b71465f163ce0413a5834
Author: David Mitchell <[email protected]>
Date:   Thu Jun 20 14:22:42 2013 +0100

    add some basic floating /\G/ tests
    
    Floating is when the \G is an unknown number of characters from the start
    of the pattern, such as /a+\G/. Surprisingly, there were no tests for this
    form.
    
    Here are a few basic tests just to exercise the main code paths. More
    comprehensive tests could do with being added at some point.

M       t/re/pat.t

commit 2986c433c51cfea4b11e524475cc3526450ba5c5
Author: David Mitchell <[email protected]>
Date:   Thu Jun 20 13:33:31 2013 +0100

    fix /.\G/ under threading
    
    When a regex was being duped, it's (constant) gofs field wasn't being
    copied, but rather was being set to zero. Skip this and lots of TODO tests
    pass.

M       regcomp.c
M       t/re/pat.t

commit 631daf9463efb3dda38e67ea0dc6061744cdddc5
Author: David Mitchell <[email protected]>
Date:   Wed Jun 19 12:44:41 2013 +0100

    skip creating new capture COW SV if possible
    
    Each time we do a match, we currently (where possible) make a COW copy of
    the just-matched string. This involves creating a new SV that shares the
    same PVX buffer with the string. In a repeated match like while (/.../g),
    that means the each time round we free the old capture SV and create a new
    one.
    
    As as optimisation, skip the free/create if the old capture SV is already
    a COW clone of the match string.

M       regexec.c

commit 0d2eef90cf445eee4c95ff55716b705b89675ae9
Author: David Mitchell <[email protected]>
Date:   Tue Jun 18 16:34:43 2013 +0100

    make Perl_reg_set_capture_string static
    
    This function was introduced a few commits ago. Since it's now only
    called from within regexec.c, make it static.

M       embed.fnc
M       embed.h
M       proto.h
M       regexec.c

commit feece1731aaf4eb7d1a4a713852c95c918cfb84b
Author: David Mitchell <[email protected]>
Date:   Tue Jun 18 16:17:39 2013 +0100

    add intuit-only match to s///
    
    pp_match() has an intuit-only match mode: if intuit_start() succeeds and
    the regex is marked as only needing intuit (RXf_CHECK_ALL), then calling
    regexec() is skipped, and just $& set and then returns.
    
    The commit which originally added that feature to pp_match() also added a
    comment to pp_subst() suggesting that the same thing could be done there.
    
    This commit finally achieves that. It builds on the previous commit (which
    moved this mechanism from pp_match() directly into regexec()), skipping
    calling intuit_start() and directly calling regexec() with the
    REXEC_CHECKED flag not set.
    
    This appears to reduce the execution time of a simple substitution
    like s/abc/def/ by a fifth.

M       pp_hot.c

commit ff470c8f375210eba1a3b0d7790de7f3591126ee
Author: David Mitchell <[email protected]>
Date:   Tue Jun 18 14:44:12 2013 +0100

    move intuit call from pp_match() into regexec()
    
    Currently the main part of pp_match() looks like:
    
        if (can_use_intuit) {
            if (!intuit_start())
                goto nope;
            if (can_match_based_only_on_intuit_result) {
                ... set up $&, $-[0] etc ...
                goto gotcha;
            }
        }
        if (!regexec(..., REXEC_CHECKED|r_flags))
            goto nope;
    
      gotcha:
        ...
    
    This rather breaks the regex API encapulation. The caller of the regex
    engine shouldn't have to worry about whether to call intuit() or
    regexec(), and to know to set $& in the intuit-only case.
    
    So, move all the intuit-calling and $& setting into regexec itself.
    This is cleaner, and will also shortly allow us to enable intuit-only
    matches in pp_subst() too. After this change, the code above looks like
    (in its entirety):
    
        if (!regexec(..., r_flags))
            goto nope;
    
        ...
    
    There, isn't that nicer?

M       pp_hot.c
M       regexec.c

commit e6c8c2443d8b8dbf9a985bf5bac3f657595cb44e
Author: David Mitchell <[email protected]>
Date:   Tue Jun 18 12:29:16 2013 +0100

    make intuit_start() handle mixed utf8-ness
    
    Fix a bug in intuit_start() that makes it fail when the utf8-ness of the
    string and pattern differ. This was mostly masked, since pp_match() skips
    calling intuit in this case (and has done since 2000, presumably as a
    workaround for this issue, and possibly for other issues since fixed).
    But pp_subst() didn't skip, so code like this would fail:
    
        $c = "\x{c0}";
        utf8::upgrade($c);
        print "ok\n" if $c =~ s/\xC0{1,2}$/\xC0/i;
    
    Now that intuit is (hopefully) fixed, also remove the guard in pp_match().

M       pp_hot.c
M       regexec.c

commit de2f7dcf0030f7c69270f4d7ce50f55afa3ba689
Author: David Mitchell <[email protected]>
Date:   Mon Jun 17 17:38:41 2013 +0100

    pp_match(): fix UTF* match setting
    
    A recent commit did RX_MATCH_UTF8_set() based on the utf8-ness of the
    pattern rather than the match string. I didn't matter because in that
    branch they were guaranteed to have the same value, but fix it anyway,
    both for correctness sake, and because it it *will* matter shortly

M       pp_hot.c

commit 97d4c95a0d9ceab30d7bc3158c7cf3ab72df09af
Author: David Mitchell <[email protected]>
Date:   Sun Jun 16 16:54:09 2013 +0100

    pp_match(): intuit can handle refs these days
    
    It looks like we no longer need to skip intuit-only matching when the
    match is a ref or overloaded (e.g. $ref =~ /ARRAY/)

M       pp_hot.c

commit 6ac0338085cb31194c9eddce9c96701cdfe4b842
Author: David Mitchell <[email protected]>
Date:   Sun Jun 16 16:09:07 2013 +0100

    pp_match(): remove ret_no label
    
    The nope: and ret_no: labels labelled the same point in the code.
    Eliminate one of them.

M       pp_hot.c

commit 525acc3719327664cf5b77719b106388f9f45082
Author: David Mitchell <[email protected]>
Date:   Sun Jun 16 16:01:22 2013 +0100

    pp_match(): combine intuit and regexec branches
    
    There was some code that looked roughly like:
    
        if (can_match_on_intuit_only) {
            ....
            goto yup;
        }
        if (!regexec())
            goto ret_no;
    
      gotcha:
        A; B;
        if (simple)
            RETURNYES;
        X; Y;
        RETURN;
    
      yup:
        A;
        if (!simple)
            goto gotcha;
        B;
        RETURNYES
    
    Refactor it to look like
    
        if (can_match_on_intuit_only) {
            ....
            goto gotcha;
        }
        if (!regexec())
            goto ret_no;
    
      gotcha:
        A; B;
        if (simple)
            RETURNYES;
        X; Y;
        RETURN;
    
    As well as simplifying the code, it also avoids duplicating some work
    (the 'A' above was done twice sometimes) - harmless but less efficient.

M       pp_hot.c

commit 5d661065f47e48f488cb0020e5045ffddf6ebd83
Author: David Mitchell <[email protected]>
Date:   Sun Jun 16 15:45:20 2013 +0100

    pp_match(): refactor intuit-only code
    
    change
    
        if (intuit_only)
            goto yup:
        ...
      yup:
        A; B; X; Y;
    
    to
    
        if (intuit_only)
            A; B;
            goto yup:
        ...
      yup:
        X; Y;
    
    where A and B are intuit_only-specific steps while X and Y are done by the
    regexec() branch too. This will shortly allow us to merge the two
    branches.

M       pp_hot.c

commit 1f2f7429aff3e259852fc9056fe2ba89f773c754
Author: David Mitchell <[email protected]>
Date:   Sun Jun 16 15:38:56 2013 +0100

    pp_match(): minor refactor: consolidate RETPUSHYES
    
    Make the code slightly simpler by doing an early RETPUSHYES after success
    where possible.

M       pp_hot.c

commit d0e21621efe7391ad3eb971d8d0a7ab8c13be89b
Author: David Mitchell <[email protected]>
Date:   Sun Jun 16 14:27:19 2013 +0100

    pp_match(): factor out some common code
    
    Some identical code is used in two separate branches to set pos()
    after a successful match. Hoist the common code to above the branch.

M       pp_hot.c

commit f59df168861df28247c14ec8a8f2cd4f2de7e1db
Author: David Mitchell <[email protected]>
Date:   Sun Jun 16 13:26:30 2013 +0100

    re-enable intuit-only matches
    
    The COW changes inadvertently disabled intuit-only matches.
    These are where calling intuit_start() to find the starting point for a
    match is enough to know that the whole pattern will match, and so you can
    skip calling regexec() too. For example, fixed strings without captures
    such as /abc/.
    
    The COW breakage meant that regexec was always called, making something
    like /abc/ abut 3 times slower.
    
    This commit re-enables intuit-only matches.
    
    However, it turns out that this opens up a can of worms.
    Normally, recording the just-matched-against string so that things like $&
    and captures work, is done within regexec(). When this is skipped,
    pp_match has to do a similar thing itself. The code that does this (which
    is in principle a copy of the code in regexec()) is a bit of a mess. Due
    to a logic error, a big chunk of it has actually been dead code for 10+
    years.  Its had lots of modifications (e.g. people have made the same
    changes to regexec() and pp_match()), but since it never gets executed,
    errors aren't detected. And the bits that are executed haven't completely
    received all the COW and SAWAMERSAND updates that have happened recently.
    
    The Best way to fix this is is to extract out the capture code in
    regexec() into a separate function (which we did in the previous commit),
    then throw away all the broken capture code in pp_match() and replace it
    with a call to the new function (which this commit does).
    
    One side effect of this commit is that as well as restoring intuit-only
    behaviour for the patterns that used to pre-COW, it also enables this
    behaviour for patterns which formerly didn't, namely where $& or //p are
    seen.
    
    This commit is the barest minimum necessary to fix this; subsequent
    commits will clean and improve this.

M       pp_hot.c

commit 48fc04c67b865361579fc23a5f8e81ba871cd4c5
Author: David Mitchell <[email protected]>
Date:   Sat Jun 15 17:54:10 2013 +0100

    add Perl_reg_set_capture_string() function
    
    Cut and paste into a separate function, the block of code in
    regexec_flags() that is responsible (on successful match) for setting
    RX_SAVED_COPY, RX_SUBBEG etc, ready for use by capture vars like $1, $&.
    
    Although this function is currently only called from one place, we will
    shortly use it elsewhere too.
    
    This should contain no functional changes.

M       embed.fnc
M       embed.h
M       proto.h
M       regexec.c
-----------------------------------------------------------------------

--
Perl5 Master Repository

[perl.git] branch sprout/regexp, created. v5.19.2-152-gf443632

Reply via email to