In perl.git, the branch smoke-me/davem/regex-trailing-null has been created

<http://perl5.git.perl.org/perl.git/commitdiff/4eb65639bf11bc6be6bbf96317594975fb019312?hp=0000000000000000000000000000000000000000>

        at  4eb65639bf11bc6be6bbf96317594975fb019312 (commit)

- Log -----------------------------------------------------------------
commit 4eb65639bf11bc6be6bbf96317594975fb019312
Author: David Mitchell <[email protected]>
Date:   Fri Sep 21 10:29:04 2012 +0100

    stop regex engine reading beyond end of string
    
    Historically the regex engine has assumed that any string passed to it
    will have a trailing null char. This isn't normally an issue in perl code,
    since perl strings *are* null terminated; but it could cause problems with
    strings returned by XS code, or with someone calling the regex engine
    directly from XS, with strend not pointing at a null char.
    
    The engine currently relies on there being a null char in the following
    ways.
    
    First, when at the end of string, the main loop of regmatch() still reads
    in the 'next' character (i.e. the character following the end of string)
    even if it doesn't make any use of it. This precludes using memory mapped
    files as strings for example, since the read off the end would SEGV.
    
    Second, the matching algorithm often required the trailing character to be
    \0 to work correctly: the test for 'EOF' was "if next char is null *and*
    locinput >= PL_regeol, then stop". So a random non-null trailing char
    could cause an overshoot.
    
    Thirdly, some match ops require the trailing char to be null to operate
    correctly; for example, \b applied at the end of the string only happens
    to work because the trailing char (\0) happens to match \W.
    
    Also, some utf8 ops will try to extract the code point at the end, which
    can result in multiple bytes past the end of string being read, and
    possible problems if they don't correspond to well-formed utf8.
    
    The main fix is in S_regmatch, where the 'read next char' code has been
    updated to set it to a special value, NEXTCHR_EOS instead, if we would be
    reading past the end of the string.
    
    Lots of other random bits in the regex engine needed to be fixed up too.
    
    To track these down, I temporarily hacked regexec_flags() to make a copy
    of the string but without trailing \0, then ran all the t/re/*.t tests
    under valgrind to flush out all buffer overruns. So I think I've removed
    most of the bad code, but by no means all of it. The code within the
    various functions in regexec.c is far too complex to be able to visually
    audit the code with any confidence.

M       MANIFEST
M       ext/XS-APItest/APItest.xs
A       ext/XS-APItest/t/callregexec.t
M       regexec.c

commit 1a5ccdc28ff4ac9d80a900509a8d039a4e1ac8b1
Author: David Mitchell <[email protected]>
Date:   Sun Sep 16 17:39:06 2012 +0100

    regmatch(): fix typo in TRIE commentary text

M       regexec.c

commit 3a4767a4bd0ee0bdf79e77af8e0fac6ff77a4580
Author: David Mitchell <[email protected]>
Date:   Sun Sep 16 17:33:08 2012 +0100

    regmatch() annotate ops and separate out branches
    
    Annotate each 'case OP:' in the main switch in regmatch() to show
    what regex pattern this implements. About half the ops had already been
    done. Also add a blank line between each 'case' statement for readability.
    (no code changes)

M       regexec.c

commit 5a3bc858aee1ab4c87bc852fab8bd61483434323
Author: David Mitchell <[email protected]>
Date:   Fri Sep 14 16:19:10 2012 +0100

    regmatch(): do nextchr=*locinput at top of loop
    
    Currently each branch in the main regmatch() loop is responsible
    re-initialising nextchar to UCHARAT(locinput) if locinput is modified.
    
    By adding
        nextchr = UCHARAT(locinput);
    to the head of the loop, we can remove most of the nextchar assignments
    in the individual branches. We lose slightly for the zero-width assertions
    like \b which will re-read the same nextchar, but this will make it
    easier to handle non-null-terminated strings.

M       regexec.c

commit e6793fcf07ccef2f99998a85aafa35c1e8c3df4f
Author: David Mitchell <[email protected]>
Date:   Fri Sep 14 15:46:47 2012 +0100

    regmatch(): nextchar should always be positive
    
    Remove the one bit of code that tests for < 0, and put in a
    general assert.

M       regexec.c

commit aa73addbfd79d811ab8840488ce8661cc1edc701
Author: David Mitchell <[email protected]>
Date:   Fri Sep 14 12:37:33 2012 +0100

    regmatch(): consolidate locinput++
    
    There are several places in the code that increment locinput by 1 char
    (which may or may not be 1 byte) then update nextchr.
    
    Consolidate these into a single code block with the others goto'ing it.
    This actually reduces the code more than it appears, since the CCC_TRY*
    macros expand into several branches, each of which repeatthe
    increment code.

M       regexec.c

commit d8749b2d9d09747aeb7210d8914687b7eb705d3b
Author: David Mitchell <[email protected]>
Date:   Fri Sep 14 11:28:08 2012 +0100

    regmatch(): use nextchar where available
    
    In a couple of places the code was using *locinput, where
    nextchar already equalled *locinput

M       regexec.c
-----------------------------------------------------------------------

--
Perl5 Master Repository

Reply via email to