regex-trailing-null, created. v5.17.4-31-g8f6719e

Dave Mitchell Fri, 21 Sep 2012 16:56:36 -0700

In perl.git, the branch smoke-me/davem/regex-trailing-null has been created


<http://perl5.git.perl.org/perl.git/commitdiff/8f6719e2c3acfbc7536d81e44386a43bd0a24aab?hp=0000000000000000000000000000000000000000>

        at  8f6719e2c3acfbc7536d81e44386a43bd0a24aab (commit)

- Log -----------------------------------------------------------------
commit 8f6719e2c3acfbc7536d81e44386a43bd0a24aab
Author: David Mitchell <[email protected]>
Date:   Fri Sep 21 10:29:04 2012 +0100

    stop regex engine reading beyond end of string
    
    Historically the regex engine has assumed that any string passed to it
    will have a trailing null char. This isn't normally an issue in perl code,
    since perl strings *are* null terminated; but it could cause problems with
    strings returned by XS code, or with someone calling the regex engine
    directly from XS, with strend not pointing at a null char.
    
    The engine currently relies on there being a null char in the following
    ways.
    
    First, when at the end of string, the main loop of regmatch() still reads
    in the 'next' character (i.e. the character following the end of string)
    even if it doesn't make any use of it. This precludes using memory mapped
    files as strings for example, since the read off the end would SEGV.
    
    Second, the matching algorithm often required the trailing character to be
    \0 to work correctly: the test for 'EOF' was "if next char is null *and*
    locinput >= PL_regeol, then stop". So a random non-null trailing char
    could cause an overshoot.
    
    Thirdly, some match ops require the trailing char to be null to operate
    correctly; for example, \b applied at the end of the string only happens
    to work because the trailing char (\0) happens to match \W.
    
    Also, some utf8 ops will try to extract the code point at the end, which
    can result in multiple bytes past the end of string being read, and
    possible problems if they don't correspond to well-formed utf8.
    
    The main fix is in S_regmatch, where the 'read next char' code has been
    updated to set it to a special value, NEXTCHR_EOS instead, if we would be
    reading past the end of the string.
    
    Lots of other random bits in the regex engine needed to be fixed up too.
    
    To track these down, I temporarily hacked regexec_flags() to make a copy
    of the string but without trailing \0, then ran all the t/re/*.t tests
    under valgrind to flush out all buffer overruns. So I think I've removed
    most of the bad code, but by no means all of it. The code within the
    various functions in regexec.c is far too complex to be able to visually
    audit the code with any confidence.

M       MANIFEST
M       ext/XS-APItest/APItest.pm
M       ext/XS-APItest/APItest.xs
A       ext/XS-APItest/t/callregexec.t
M       regexec.c

commit ffb83602ac7621e306ecae2bc5a8b0d224eb3d87
Author: David Mitchell <[email protected]>
Date:   Sun Sep 16 17:39:06 2012 +0100

    regmatch(): fix typo in TRIE commentary text

M       regexec.c

commit 927ce50c99cfffa62ba5ada03562f9da75224a1c
Author: David Mitchell <[email protected]>
Date:   Sun Sep 16 17:33:08 2012 +0100

    regmatch() annotate ops and separate out branches
    
    Annotate each 'case OP:' in the main switch in regmatch() to show
    what regex pattern this implements. About half the ops had already been
    done. Also add a blank line between each 'case' statement for readability.
    (no code changes)

M       regexec.c

commit b05efd3c9cc1583c4a8b1719b69077edd9c397df
Author: David Mitchell <[email protected]>
Date:   Fri Sep 14 16:19:10 2012 +0100

    regmatch(): do nextchr=*locinput at top of loop
    
    Currently each branch in the main regmatch() loop is responsible
    re-initialising nextchar to UCHARAT(locinput) if locinput is modified.
    
    By adding
        nextchr = UCHARAT(locinput);
    to the head of the loop, we can remove most of the nextchar assignments
    in the individual branches. We lose slightly for the zero-width assertions
    like \b which will re-read the same nextchar, but this will make it
    easier to handle non-null-terminated strings.

M       regexec.c

commit 6855194d74be66127b6d32dd40a26ddcd0785867
Author: David Mitchell <[email protected]>
Date:   Fri Sep 14 15:46:47 2012 +0100

    regmatch(): nextchar should always be positive
    
    Remove the one bit of code that tests for < 0, and put in a
    general assert.

M       regexec.c

commit 996dc38f68a45f2bd8cf33d4b2f24775fad675ff
Author: David Mitchell <[email protected]>
Date:   Fri Sep 14 12:37:33 2012 +0100

    regmatch(): consolidate locinput++
    
    There are several places in the code that increment locinput by 1 char
    (which may or may not be 1 byte) then update nextchr.
    
    Consolidate these into a single code block with the others goto'ing it.
    This actually reduces the code more than it appears, since the CCC_TRY*
    macros expand into several branches, each of which repeatthe
    increment code.

M       regexec.c

commit 10cd6a101a65575e939faa0a2e805236aa2adf51
Author: David Mitchell <[email protected]>
Date:   Fri Sep 14 11:28:08 2012 +0100

    regmatch(): use nextchar where available
    
    In a couple of places the code was using *locinput, where
    nextchar already equalled *locinput

M       regexec.c
-----------------------------------------------------------------------

--
Perl5 Master Repository

[perl.git] branch smoke-me/davem/regex-trailing-null, created. v5.17.4-31-g8f6719e

Reply via email to