ID: 23946 User updated by: stuge-phpbugs at cdy dot org Reported By: stuge-phpbugs at cdy dot org Status: Verified Bug Type: *Regular Expressions Operating System: Linux PHP Version: 4.3.2 New Comment:
Ok. This patch works in that it gives me the expected results. I modified it slightly to care less about bytes in the string and more about the length variables, that way it's at least widechar-prepared. Woo-hoo! :) Currently running make test.. All reg tests PASS, I've also added a test for this bug. Here's the patch against current CVS: ---8<--- diff -ur php4.cvs/ext/pcre/php_pcre.c php4.new/ext/pcre/php_pcre.c --- php4.cvs/ext/pcre/php_pcre.c 2003-06-07 03:51:13.000000000 +0200 +++ php4.new/ext/pcre/php_pcre.c 2003-06-07 03:52:02.000000000 +0200 @@ -797,7 +797,7 @@ *result_len = 0; start_offset = 0; - while (1) { + do { /* Execute the regular expression. */ count = pcre_exec(re, extra, subject, subject_len, start_offset, exoptions|g_notempty, offsets, size_offsets); @@ -933,7 +933,7 @@ /* Advance to the next piece. */ start_offset = offsets[1]; - } + } while(start_offset < subject_len); efree(offsets); diff -ur php4.cvs/ext/standard/reg.c php4.new/ext/standard/reg.c --- php4.cvs/ext/standard/reg.c 2003-06-07 03:49:19.000000000 +0200 +++ php4.new/ext/standard/reg.c 2003-06-07 03:50:43.000000000 +0200 @@ -302,7 +302,7 @@ err = pos = 0; buf[0] = '\0'; - while (!err) { + do { err = regexec(&re, &string[pos], re.re_nsub+1, subs, (pos ? REG_ NOTBOL : 0)); if (err && err != REG_NOMATCH) { @@ -396,7 +396,7 @@ /* stick that last bit of string on our output */ strcat(buf, &string[pos]); } - } + } while(!err && pos < string_len); /* don't want to leak memory .. */ efree(subs); --->8--- And ext/standard/tests/reg/017.phpt: ---8<--- --TEST-- empty expression replaced extra time at end-of-string --POST-- --GET-- --FILE-- <?php echo ereg_replace('.?$',"a","xyz")?> --EXPECT-- xya --->8--- I make no warranties whatsoever that this will not break things, although I don't think it should. :) Previous Comments: ------------------------------------------------------------------------ [2003-06-06 20:34:53] stuge-phpbugs at cdy dot org $ echo xyz|perl -e 'while(<STDIN>) { $_=~ s/z?$/a/;print}' xya $ echo xyz|perl -e 'while(<STDIN>) { $_=~ s/z?$/a/g;print}' xyaa a$ echo xyz|sed -e 's/z\?$/a/g' xya $ echo xyz|sed -e 's/z\?$/a/' xya perl and sed both behave the way I would expect. [ep]reg_replace() don't. :) A comment from php.net/manual/en/pcre.pattern.modifiers.php: /g is default in PHP. You don't need to set it. preg_replace() similarity with perl may not be desired, I only included it because it showed the exact same behaviour while using a different regex code base, indicating the same algorithm. I believe Andrei's comment in 23903 is a bit quick, at least if this is a dupe, which I would say it is. Again, I believe the error to be that the regexp is "ran" one extra time after all of the input string has already been matched and processed inside [ep]reg_replace(). Because of this, any regexp that matches the empty string will cause one extra replacement to occur at the end of the input string. At least inside ereg_replace() the loop should just quit if the entire string has been processed after the first iteration. I'll try my patch and recomment. ------------------------------------------------------------------------ [2003-06-03 12:59:57] [EMAIL PROTECTED] Verified with 4.3.2 (and 4.3.1 and 4.2.3). Similar to #23903, but here I see why you want to use [ep]reg_replace. ------------------------------------------------------------------------ [2003-06-01 22:38:09] stuge-phpbugs at cdy dot org Verified with 4.3.0 and 4.3.2rc2 but haven't tried 4.3.2. NEWS for 4.3.2 do not mention any *reg_replace changes. Reproducible with eregi_replace(). <?php $f="xyz"; $g=ereg_replace("z?$","a",$f); $h=preg_replace("/z?$/","a",$f); print $f; /* xyz */ print " ".$g; /* xyaa, I want xya */ print " ".$h; /* xyaa, I want xya */ ?> At least in CVS php4/ext/standard/reg.c I believe the problem is that regexec() on line 306 will be called a second time after the first iteration has already matched to the end of the string. Matching from the end of the string should only be allowed on the very first iteration. If this analysis is correct, the following patch might work, but I haven't tried it. --- reg.c.org 2003-06-02 05:23:51.000000000 +0200 +++ reg.c 2003-06-02 05:26:06.000000000 +0200 @@ -302,7 +302,7 @@ err = pos = 0; buf[0] = '\0'; - while (!err) { + do { err = regexec(&re, &string[pos], re.re_nsub+1, subs, (pos ? REG_NOTBOL : 0)); if (err && err != REG_NOMATCH) { @@ -396,7 +396,7 @@ /* stick that last bit of string on our output */ strcat(buf, &string[pos]); } - } + } while(!err && string[pos]); /* don't want to leak memory .. */ efree(subs); ------------------------------------------------------------------------ -- Edit this bug report at http://bugs.php.net/?id=23946&edit=1