In perl.git, the branch smoke-me/davem/regex-trailing-null has been created
<http://perl5.git.perl.org/perl.git/commitdiff/4eb65639bf11bc6be6bbf96317594975fb019312?hp=0000000000000000000000000000000000000000> at 4eb65639bf11bc6be6bbf96317594975fb019312 (commit) - Log ----------------------------------------------------------------- commit 4eb65639bf11bc6be6bbf96317594975fb019312 Author: David Mitchell <[email protected]> Date: Fri Sep 21 10:29:04 2012 +0100 stop regex engine reading beyond end of string Historically the regex engine has assumed that any string passed to it will have a trailing null char. This isn't normally an issue in perl code, since perl strings *are* null terminated; but it could cause problems with strings returned by XS code, or with someone calling the regex engine directly from XS, with strend not pointing at a null char. The engine currently relies on there being a null char in the following ways. First, when at the end of string, the main loop of regmatch() still reads in the 'next' character (i.e. the character following the end of string) even if it doesn't make any use of it. This precludes using memory mapped files as strings for example, since the read off the end would SEGV. Second, the matching algorithm often required the trailing character to be \0 to work correctly: the test for 'EOF' was "if next char is null *and* locinput >= PL_regeol, then stop". So a random non-null trailing char could cause an overshoot. Thirdly, some match ops require the trailing char to be null to operate correctly; for example, \b applied at the end of the string only happens to work because the trailing char (\0) happens to match \W. Also, some utf8 ops will try to extract the code point at the end, which can result in multiple bytes past the end of string being read, and possible problems if they don't correspond to well-formed utf8. The main fix is in S_regmatch, where the 'read next char' code has been updated to set it to a special value, NEXTCHR_EOS instead, if we would be reading past the end of the string. Lots of other random bits in the regex engine needed to be fixed up too. To track these down, I temporarily hacked regexec_flags() to make a copy of the string but without trailing \0, then ran all the t/re/*.t tests under valgrind to flush out all buffer overruns. So I think I've removed most of the bad code, but by no means all of it. The code within the various functions in regexec.c is far too complex to be able to visually audit the code with any confidence. M MANIFEST M ext/XS-APItest/APItest.xs A ext/XS-APItest/t/callregexec.t M regexec.c commit 1a5ccdc28ff4ac9d80a900509a8d039a4e1ac8b1 Author: David Mitchell <[email protected]> Date: Sun Sep 16 17:39:06 2012 +0100 regmatch(): fix typo in TRIE commentary text M regexec.c commit 3a4767a4bd0ee0bdf79e77af8e0fac6ff77a4580 Author: David Mitchell <[email protected]> Date: Sun Sep 16 17:33:08 2012 +0100 regmatch() annotate ops and separate out branches Annotate each 'case OP:' in the main switch in regmatch() to show what regex pattern this implements. About half the ops had already been done. Also add a blank line between each 'case' statement for readability. (no code changes) M regexec.c commit 5a3bc858aee1ab4c87bc852fab8bd61483434323 Author: David Mitchell <[email protected]> Date: Fri Sep 14 16:19:10 2012 +0100 regmatch(): do nextchr=*locinput at top of loop Currently each branch in the main regmatch() loop is responsible re-initialising nextchar to UCHARAT(locinput) if locinput is modified. By adding nextchr = UCHARAT(locinput); to the head of the loop, we can remove most of the nextchar assignments in the individual branches. We lose slightly for the zero-width assertions like \b which will re-read the same nextchar, but this will make it easier to handle non-null-terminated strings. M regexec.c commit e6793fcf07ccef2f99998a85aafa35c1e8c3df4f Author: David Mitchell <[email protected]> Date: Fri Sep 14 15:46:47 2012 +0100 regmatch(): nextchar should always be positive Remove the one bit of code that tests for < 0, and put in a general assert. M regexec.c commit aa73addbfd79d811ab8840488ce8661cc1edc701 Author: David Mitchell <[email protected]> Date: Fri Sep 14 12:37:33 2012 +0100 regmatch(): consolidate locinput++ There are several places in the code that increment locinput by 1 char (which may or may not be 1 byte) then update nextchr. Consolidate these into a single code block with the others goto'ing it. This actually reduces the code more than it appears, since the CCC_TRY* macros expand into several branches, each of which repeatthe increment code. M regexec.c commit d8749b2d9d09747aeb7210d8914687b7eb705d3b Author: David Mitchell <[email protected]> Date: Fri Sep 14 11:28:08 2012 +0100 regmatch(): use nextchar where available In a couple of places the code was using *locinput, where nextchar already equalled *locinput M regexec.c ----------------------------------------------------------------------- -- Perl5 Master Repository
