In perl.git, the branch smoke-me/davem/regex-trailing-null has been created
<http://perl5.git.perl.org/perl.git/commitdiff/8f6719e2c3acfbc7536d81e44386a43bd0a24aab?hp=0000000000000000000000000000000000000000> at 8f6719e2c3acfbc7536d81e44386a43bd0a24aab (commit) - Log ----------------------------------------------------------------- commit 8f6719e2c3acfbc7536d81e44386a43bd0a24aab Author: David Mitchell <[email protected]> Date: Fri Sep 21 10:29:04 2012 +0100 stop regex engine reading beyond end of string Historically the regex engine has assumed that any string passed to it will have a trailing null char. This isn't normally an issue in perl code, since perl strings *are* null terminated; but it could cause problems with strings returned by XS code, or with someone calling the regex engine directly from XS, with strend not pointing at a null char. The engine currently relies on there being a null char in the following ways. First, when at the end of string, the main loop of regmatch() still reads in the 'next' character (i.e. the character following the end of string) even if it doesn't make any use of it. This precludes using memory mapped files as strings for example, since the read off the end would SEGV. Second, the matching algorithm often required the trailing character to be \0 to work correctly: the test for 'EOF' was "if next char is null *and* locinput >= PL_regeol, then stop". So a random non-null trailing char could cause an overshoot. Thirdly, some match ops require the trailing char to be null to operate correctly; for example, \b applied at the end of the string only happens to work because the trailing char (\0) happens to match \W. Also, some utf8 ops will try to extract the code point at the end, which can result in multiple bytes past the end of string being read, and possible problems if they don't correspond to well-formed utf8. The main fix is in S_regmatch, where the 'read next char' code has been updated to set it to a special value, NEXTCHR_EOS instead, if we would be reading past the end of the string. Lots of other random bits in the regex engine needed to be fixed up too. To track these down, I temporarily hacked regexec_flags() to make a copy of the string but without trailing \0, then ran all the t/re/*.t tests under valgrind to flush out all buffer overruns. So I think I've removed most of the bad code, but by no means all of it. The code within the various functions in regexec.c is far too complex to be able to visually audit the code with any confidence. M MANIFEST M ext/XS-APItest/APItest.pm M ext/XS-APItest/APItest.xs A ext/XS-APItest/t/callregexec.t M regexec.c commit ffb83602ac7621e306ecae2bc5a8b0d224eb3d87 Author: David Mitchell <[email protected]> Date: Sun Sep 16 17:39:06 2012 +0100 regmatch(): fix typo in TRIE commentary text M regexec.c commit 927ce50c99cfffa62ba5ada03562f9da75224a1c Author: David Mitchell <[email protected]> Date: Sun Sep 16 17:33:08 2012 +0100 regmatch() annotate ops and separate out branches Annotate each 'case OP:' in the main switch in regmatch() to show what regex pattern this implements. About half the ops had already been done. Also add a blank line between each 'case' statement for readability. (no code changes) M regexec.c commit b05efd3c9cc1583c4a8b1719b69077edd9c397df Author: David Mitchell <[email protected]> Date: Fri Sep 14 16:19:10 2012 +0100 regmatch(): do nextchr=*locinput at top of loop Currently each branch in the main regmatch() loop is responsible re-initialising nextchar to UCHARAT(locinput) if locinput is modified. By adding nextchr = UCHARAT(locinput); to the head of the loop, we can remove most of the nextchar assignments in the individual branches. We lose slightly for the zero-width assertions like \b which will re-read the same nextchar, but this will make it easier to handle non-null-terminated strings. M regexec.c commit 6855194d74be66127b6d32dd40a26ddcd0785867 Author: David Mitchell <[email protected]> Date: Fri Sep 14 15:46:47 2012 +0100 regmatch(): nextchar should always be positive Remove the one bit of code that tests for < 0, and put in a general assert. M regexec.c commit 996dc38f68a45f2bd8cf33d4b2f24775fad675ff Author: David Mitchell <[email protected]> Date: Fri Sep 14 12:37:33 2012 +0100 regmatch(): consolidate locinput++ There are several places in the code that increment locinput by 1 char (which may or may not be 1 byte) then update nextchr. Consolidate these into a single code block with the others goto'ing it. This actually reduces the code more than it appears, since the CCC_TRY* macros expand into several branches, each of which repeatthe increment code. M regexec.c commit 10cd6a101a65575e939faa0a2e805236aa2adf51 Author: David Mitchell <[email protected]> Date: Fri Sep 14 11:28:08 2012 +0100 regmatch(): use nextchar where available In a couple of places the code was using *locinput, where nextchar already equalled *locinput M regexec.c ----------------------------------------------------------------------- -- Perl5 Master Repository
