In perl.git, the branch smoke-me/davem-regex-buffer-copy has been created

<http://perl5.git.perl.org/perl.git/commitdiff/2bcd9b62481f3d1cdcd60ef3e7b263485761ddb3?hp=0000000000000000000000000000000000000000>

        at  2bcd9b62481f3d1cdcd60ef3e7b263485761ddb3 (commit)

- Log -----------------------------------------------------------------
commit 2bcd9b62481f3d1cdcd60ef3e7b263485761ddb3
Author: David Mitchell <[email protected]>
Date:   Fri Sep 7 13:32:11 2012 +0100

    fix a bug in handling $+[0] and unicode
    
    The code to decide what substring of a pattern target to copy for the
    sake of $1, $& etc, would, in the absence of $&, only copy the minimum
    range needed to cover $1,$2,...., which might be a shorter range than
    what $& covers. This is fine most of the time, but, when calculating
    $+[0] on a unicode string, it needs a copy of the whole part of the string
    covered by $&, since it needs to convert the byte offest into a char
    offset.
    So to fix this, always copy as a minimum, the $& range.
    I suppose we could be more clever about this: detect the presence
    of @+ in the code, only do it for UTF8 etc; but this is simple
    and non-fragile.

M       regexec.c
M       t/re/re_tests

commit 500949cafaf334272b21a89bf044bc3bcf0cd4b3
Author: David Mitchell <[email protected]>
Date:   Sat Sep 1 11:43:53 2012 +0100

    m// and s///; don't copy TEMP/AMAGIC strings
    
    Currently pp_match and pp_subst make a copy of the match string if it's
    SvTEMP(), and in the case of pp_match, also if it's SvAMAGIC().
    
    This is no longer necessary, as the code will always copy the string
    anyway if its actually needed after the match, i.e. if it detects the
    presence of $1, $& or //p etc. Until a few commits ago, this wasn't the
    case for pp_match: it would sometimes skip copying even in the presence of
    $1 et al for efficiency reasons. Now that that's fixed, we can remove the
    SvTEMP() and SvAMAGIC() tests.
    
    As to why pp_subst did the SvTEMP test, I don't know: but removing it
    didn't make any tests fail!

M       pp_hot.c

commit ed17c6e0c4437f531375c687e04038aac0a3fde5
Author: David Mitchell <[email protected]>
Date:   Sat Sep 1 11:23:58 2012 +0100

    tidy up patten match copying code
    
    (no functional changes).
    
    1. Remove some dead code from pp_split; it's protected by an assert
    that it could never be called.
    
    2. Simplify the flags settings for the call to CALLREGEXEC() in
    pp_substcont: on subsequent matches we always set REXEC_NOT_FIRST,
    which forces the regex engine not to copy anyway, so passing the
    REXEC_COPY_STR is pointless, as is the conditional code to set it.
    
    3. (whitespace change): split a conditional expression over 2 lines
    for easier reading.

M       pp.c
M       pp_ctl.c
M       pp_hot.c

commit 17982ef0918a9bdd16fa1cba80777dfe7826eb89
Author: David Mitchell <[email protected]>
Date:   Fri Aug 24 16:17:47 2012 +0100

    stop $foo =~ /(bar)/g skipping copy
    
    Normally in the presence of captures, a successful regex execution
    makes a copy of the matched string, so that $1 et al give the right
    value even if the original string is changed; i.e.
    
        $foo =~ /(123)/g;
        $foo = "bar";
        is("$1", "123");
    
    Until now that test would fail, because perl used to skip the copy for
    the scalar /(...)/g case (but not the C<$&; //g> case). This was to
    avoid a huge slowdown in code like the following:
    
        $x = 'x' x 1_000_000;
        1 while $x =~ /(.)/g;
    
    which would otherwise end up copying a 1Mb string a million times.
    
    Now that (with the last commit but one) we copy only the required
    substring of the original string (a 1-byte substring in the above
    example), we can remove this fast-but-incorrect hack.

M       pp_hot.c
M       t/re/pat_advanced.t
M       t/re/pat_psycho.t

commit 90858c98ad7988657fadf49da91025067b38d4a9
Author: David Mitchell <[email protected]>
Date:   Fri Aug 24 15:49:21 2012 +0100

    rationalise t/re/pat_psycho.t
    
    Do some cleanup of this file, without changing its functionality.
    
    Once upon a time, the psycho tests were scattered throughout a single
    pat.t file, before being moved into their own file. Now that they're all
    in a single file, make the $PERL_SKIP_PSYCHO_TEST test a single "skip_all"
    test at the beginning of the file, rather than testing it separately in
    each code block.
    
    Also, make some of the test descriptions more useful, and add a bit of
    debugging output.

M       t/re/pat_psycho.t

commit bbe94cee54bc43dc7062e050662b3897c85af61b
Author: David Mitchell <[email protected]>
Date:   Thu Jul 26 16:04:09 2012 +0100

    Don't copy all of the match string buffer
    
    When a pattern matches, and that pattern contains captures (or $`, $&, $'
    or /p are present), a copy is made of the whole original string, so
    that $1 et al continue to hold the correct value even if the original
    string is subsequently modified. This can have severe performance
    penalties; for example, this code causes a 1Mb buffer to be allocated,
    copied and freed a million times:
    
        $&;
        $x = 'x' x 1_000_000;
        1 while $x =~ /(.)/g;
    
    This commit changes this so that, where possible, only the needed
    substring of the original string is copied: in the above case, only a
    1-byte buffer is copied each time. Also, it now reuses or reallocs the
    buffer, rather than freeing and mallocing each time.
    
    Now that PL_sawampersand is a 3-bit flag indicating separately whether
    $`, $& and $' have been seen, they each contribute only their own
    individual penalty; which ones have been seen will limit the extent to
    which we can avoid copying the whole buffer.
    
    Note that the above code *without* the $& is not currently slow, but only
    because the copying is artificially disabled to avoid the performance hit.
    The next but one commit will remove that hack, meaning that it will still
    be fast, but will now be correct in the presence of a modified original
    string.
    
    We achieve this by by adding suboffset and subcoffset fields to the
    existing subbeg and sublen fields of a regex, to indicate how many bytes
    and characters have been skipped from the logical start of the string till
    the physical start of the buffer. To avoid copying stuff at the end, we
    just reduce sublen. For example, in this:
    
        "abcdefgh" =~ /(c)d/
    
    subbeg points to a malloced buffer containing "c\0"; sublen == 1,
    and suboffset == 2 (as does subcoffset).
    
    while if $& has been seen,
    
    subbeg points to a malloced buffer containing "cd\0"; sublen == 2,
    and suboffset == 2.
    
    If in addition $' has been seen, then
    
    subbeg points to a malloced buffer containing "cdefgh\0"; sublen == 6,
    and suboffset == 2.
    
    The regex engine won't do this by default; there are two new flag bits,
    REXEC_COPY_SKIP_PRE and REXEC_COPY_SKIP_POST, which in conjunction with
    REXEC_COPY_STR, request that the engine skip the start or end of the
    buffer (it will still copy in the presence of the relevant $`, $&, $',
    /p).
    
    Only pp_match has been enhanced to use these extra flags; substitution
    can't easily benefit, since the usual action of s///g is to copy the
    whole string first time round, then perform subsequent matching iterations
    against the copy, without further copying. So you still need to copy most
    of the buffer.

M       dump.c
M       ext/Devel-Peek/t/Peek.t
M       mg.c
M       pod/perlreapi.pod
M       pp.c
M       pp_ctl.c
M       pp_hot.c
M       regcomp.c
M       regexec.c
M       regexp.h
M       t/porting/known_pod_issues.dat
M       t/re/re_tests

commit a8e569b8c2d47e53f6a3260ff9185067ec5fcc9e
Author: David Mitchell <[email protected]>
Date:   Thu Jul 26 15:35:39 2012 +0100

    Separate handling of ${^PREMATCH} from $` etc
    
    Currently the handling of getting the value, length etc of ${^PREMATCH}
    etc is identical to that of $` etc.
    
    Handle them separately, by adding RX_BUFF_IDX_CARET_PREMATCH etc
    constants to the existing RX_BUFF_IDX_PREMATCH set.
    
    This allows, when retrieving them, to always return undef if the current
    match didn't use //p. Previously the result depended on stuff such
    as whether the (non-//p) pattern included captures or not.
    
    The documentation for ${^PREMATCH} etc states that it's only guaranteed to
    return a defined value when the last pattern was //p.
    
    As well as making things more consistent, this is a necessary
    prerequisite for the following commit, which may not always copy the
    whole string during a non-//p match.

M       mg.c
M       regcomp.c
M       regexp.h
M       t/re/reg_pmod.t

commit df07e6993146350d6dd4861c50645669548fc2ea
Author: David Mitchell <[email protected]>
Date:   Fri Jun 22 16:26:08 2012 +0100

    regexec_flags(): simplify length calculation
    
    The code to calculate the length of the string to copy was
    
        PL_regeol - startpos + (stringarg - strbeg);
    
    This is a hangover from the original (perl 3) regexp implementation
    that under //i, copied and folded the original buffer: so startpos might
    not equal stringarg. These days it always is (except under a match failure
    with (*COMMIT), and the code we're interested is only executed on success).
    
    So simplify to just PL_regeol - strbeg.

M       regexec.c

commit baf273fedf22ce9ef32eca5765e6f42ce53dea51
Author: David Mitchell <[email protected]>
Date:   Fri Jun 22 12:36:03 2012 +0100

    PL_sawampersand: use 3 bit flags rather than bool
    
    Set a separate flag for each of $`, $& and $'.
    It still works fine in boolean context.
    
    This will allow us to have more refined control over what parts
    of a match string to copy (we currently copy the whole string).

M       gv.c
M       intrpvar.h
M       perl.c
M       perl.h

commit 15be01387bf616ecc45e30e8731ef8546d71c3fb
Author: David Mitchell <[email protected]>
Date:   Wed Jun 20 14:17:05 2012 +0100

    document args to regexec_flags and API
    
    Document in the API, and clarify in the source code, what the arguments
    to Perl_regexec_flags are.
    
    NB: this info is based on code inspection, not any real knowledge on my
    part.

M       pod/perlreapi.pod
M       regexec.c
-----------------------------------------------------------------------

--
Perl5 Master Repository

Reply via email to