ID: 44336 Updated by: [email protected] Reported By: frode at coretrek dot com Status: Closed Bug Type: PCRE related Operating System: Debian GNU/Linux 4.0r3 PHP Version: 5.2.6RC1 Assigned To: nlopess New Comment:
Fixed in PHP_5_2 now as well. Previous Comments: ------------------------------------------------------------------------ [2008-03-08 12:04:31] [email protected] This bug has been fixed in CVS. Snapshots of the sources are packaged every three hours; this change will be in the next snapshot. You can grab the snapshot at http://snaps.php.net/. Thank you for the report, and for helping us make PHP better. this went to the PHP 5.3 and 6 branches only and won't be ported to PHP 5.2. ------------------------------------------------------------------------ [2008-03-07 08:23:54] frode at coretrek dot com Thanks :) Do you have any idea if there's a chance to get this fix into PHP-5.2.6 before release? ------------------------------------------------------------------------ [2008-03-05 16:34:41] [email protected] Nice work! :) Assigned to maintainer. ------------------------------------------------------------------------ [2008-03-05 15:46:27] frode at coretrek dot com Here's the patch. If it doesn't come through cleanly, it's also available at http://apollo.coretrek.com/~frode/phpbug-44336.patch.txt --- php_pcre.c.orig 2008-03-05 16:37:09.000000000 +0100 +++ php_pcre.c 2008-03-05 16:38:18.000000000 +0100 @@ -599,20 +599,23 @@ match = NULL; matched = 0; PCRE_G(error_code) = PHP_PCRE_NO_ERROR; do { /* Execute the regular expression. */ count = pcre_exec(pce->re, extra, subject, subject_len, start_offset, exoptions|g_notempty, offsets, size_offsets); + /* Prevent lengthy UTF8 check on subsequent pcre_exec() calls to save time (See PHP bug 44336) */ + exoptions |= PCRE_NO_UTF8_CHECK; + /* Check for too many substrings condition. */ if (count == 0) { php_error_docref(NULL TSRMLS_CC, E_NOTICE, "Matched, but too many substrings"); count = size_offsets/3; } /* If something has matched */ if (count > 0) { matched++; match = subject + offsets[0]; @@ -1002,20 +1005,23 @@ match = NULL; *result_len = 0; start_offset = 0; PCRE_G(error_code) = PHP_PCRE_NO_ERROR; while (1) { /* Execute the regular expression. */ count = pcre_exec(pce->re, extra, subject, subject_len, start_offset, exoptions|g_notempty, offsets, size_offsets); + /* Prevent lengthy UTF8 check on subsequent pcre_exec() calls to save time (See PHP bug 44336) */ + exoptions |= PCRE_NO_UTF8_CHECK; + /* Check for too many substrings condition. */ if (count == 0) { php_error_docref(NULL TSRMLS_CC,E_NOTICE, "Matched, but too many substrings"); count = size_offsets/3; } piece = subject + start_offset; if (count > 0 && (limit == -1 || limit > 0)) { if (replace_count) { @@ -1439,20 +1445,23 @@ last_match = subject; match = NULL; PCRE_G(error_code) = PHP_PCRE_NO_ERROR; /* Get next piece if no limit or limit not yet reached and something matched*/ while ((limit_val == -1 || limit_val > 1)) { count = pcre_exec(pce->re, extra, subject, subject_len, start_offset, exoptions|g_notempty, offsets, size_offsets); + /* Prevent lengthy UTF8 check on subsequent pcre_exec() calls to save time (See PHP bug 44336) */ + exoptions |= PCRE_NO_UTF8_CHECK; + /* Check for too many substrings condition. */ if (count == 0) { php_error_docref(NULL TSRMLS_CC,E_NOTICE, "Matched, but too many substrings"); count = size_offsets/3; } /* If something matched */ if (count > 0) { match = subject + offsets[0]; ------------------------------------------------------------------------ [2008-03-05 15:44:49] frode at coretrek dot com According to ext/pcre/pcrelib/doc/pcre.txt and ext/pcre/pcrelib/ChangeLog there is a flag PCRE_NO_UTF8_CHECK which was added in libpcre 4.5: > 3. When matching a UTF-8 string, the test for a valid string at the >start has > been extended. If start_offset is not zero, PCRE now checks that >it points > to a byte that is the start of a UTF-8 character. If not, it >returns > PCRE_ERROR_BADUTF8_OFFSET (-11). Note: the whole string is still >checked; > this is necessary because there may be backward assertions in the >pattern. > When matching the same subject several times, it may save >resources to use > PCRE_NO_UTF8_CHECK on all but the first call if the string is >long. I tried patching ext/pcre/php_pcre.c and adding this PCRE_NO_UTF8_CHECK to the options passed to pcre_exec() (setting the flag after the first call to pcre_exec()) and it speeds up execution tremendously. With the patch I now get: matches: NO unicode: NO 0.024386882781982 sec matches: NO unicode: YES 0.021436929702759 sec matches: YES unicode: NO 0.060844898223877 sec matches: YES unicode: YES 0.062279939651489 sec I'll attach the patch shortly, but someone should review it to make sure it doesn't open up any security holes or buffer overflows (for example, would it be possible to create an invalid UTF-8 string by using a replacement pattern containing an invalid UTF-8 string?) ------------------------------------------------------------------------ The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at http://bugs.php.net/44336 -- Edit this bug report at http://bugs.php.net/?id=44336&edit=1
