ID:               44336
 Updated by:       [EMAIL PROTECTED]
 Reported By:      frode at coretrek dot com
-Status:           Open
+Status:           Assigned
 Bug Type:         PCRE related
 Operating System: Debian GNU/Linux 4.0r3
 PHP Version:      5.2.6RC1
-Assigned To:      
+Assigned To:      nlopess
 New Comment:

Nice work! :) 

Assigned to maintainer.


Previous Comments:
------------------------------------------------------------------------

[2008-03-05 15:46:27] frode at coretrek dot com

Here's the patch. If it doesn't come through cleanly, it's also
available at http://apollo.coretrek.com/~frode/phpbug-44336.patch.txt

--- php_pcre.c.orig     2008-03-05 16:37:09.000000000 +0100
+++ php_pcre.c  2008-03-05 16:38:18.000000000 +0100
@@ -599,20 +599,23 @@
 
        match = NULL;
        matched = 0;
        PCRE_G(error_code) = PHP_PCRE_NO_ERROR;
        
        do {
                /* Execute the regular expression. */
                count = pcre_exec(pce->re, extra, subject, subject_len,
start_offset,
                                                  exoptions|g_notempty, 
offsets, size_offsets);
 
+               /* Prevent lengthy UTF8 check on subsequent pcre_exec() calls to
save time (See PHP bug 44336) */
+               exoptions |= PCRE_NO_UTF8_CHECK;
+               
                /* Check for too many substrings condition. */  
                if (count == 0) {
                        php_error_docref(NULL TSRMLS_CC, E_NOTICE, "Matched, 
but too many
substrings");
                        count = size_offsets/3;
                }
 
                /* If something has matched */
                if (count > 0) {
                        matched++;
                        match = subject + offsets[0];
@@ -1002,20 +1005,23 @@
        match = NULL;
        *result_len = 0;
        start_offset = 0;
        PCRE_G(error_code) = PHP_PCRE_NO_ERROR;
        
        while (1) {
                /* Execute the regular expression. */
                count = pcre_exec(pce->re, extra, subject, subject_len,
start_offset,
                                                  exoptions|g_notempty, 
offsets, size_offsets);
                
+               /* Prevent lengthy UTF8 check on subsequent pcre_exec() calls to
save time (See PHP bug 44336) */
+               exoptions |= PCRE_NO_UTF8_CHECK;
+               
                /* Check for too many substrings condition. */
                if (count == 0) {
                        php_error_docref(NULL TSRMLS_CC,E_NOTICE, "Matched, but 
too many
substrings");
                        count = size_offsets/3;
                }
 
                piece = subject + start_offset;
 
                if (count > 0 && (limit == -1 || limit > 0)) {
                        if (replace_count) {
@@ -1439,20 +1445,23 @@
        last_match = subject;
        match = NULL;
        PCRE_G(error_code) = PHP_PCRE_NO_ERROR;
        
        /* Get next piece if no limit or limit not yet reached and something
matched*/
        while ((limit_val == -1 || limit_val > 1)) {
                count = pcre_exec(pce->re, extra, subject,
                                                  subject_len, start_offset,
                                                  exoptions|g_notempty, 
offsets, size_offsets);
 
+               /* Prevent lengthy UTF8 check on subsequent pcre_exec() calls to
save time (See PHP bug 44336) */
+               exoptions |= PCRE_NO_UTF8_CHECK;
+               
                /* Check for too many substrings condition. */
                if (count == 0) {
                        php_error_docref(NULL TSRMLS_CC,E_NOTICE, "Matched, but 
too many
substrings");
                        count = size_offsets/3;
                }
                                
                /* If something matched */
                if (count > 0) {
                        match = subject + offsets[0];

------------------------------------------------------------------------

[2008-03-05 15:44:49] frode at coretrek dot com

According to ext/pcre/pcrelib/doc/pcre.txt and
ext/pcre/pcrelib/ChangeLog there is a flag PCRE_NO_UTF8_CHECK which was
added in libpcre 4.5:

> 3. When matching a UTF-8 string, the test for a valid string at the
>start has
>    been extended. If start_offset is not zero, PCRE now checks that
>it points
>    to a byte that is the start of a UTF-8 character. If not, it
>returns
>    PCRE_ERROR_BADUTF8_OFFSET (-11). Note: the whole string is still
>checked;
>    this is necessary because there may be backward assertions in the
>pattern.
>    When matching the same subject several times, it may save
>resources to use
>    PCRE_NO_UTF8_CHECK on all but the first call if the string is
>long.

I tried patching ext/pcre/php_pcre.c and adding this PCRE_NO_UTF8_CHECK
to the options passed to pcre_exec() (setting the flag after the first
call to pcre_exec()) and it speeds up execution tremendously.

With the patch I now get:

matches: NO  unicode: NO  0.024386882781982 sec
matches: NO  unicode: YES 0.021436929702759 sec
matches: YES unicode: NO  0.060844898223877 sec
matches: YES unicode: YES 0.062279939651489 sec

I'll attach the patch shortly, but someone should review it to make
sure it doesn't open up any security holes or buffer overflows (for
example, would it be possible to create an invalid UTF-8 string by using
a replacement pattern containing an invalid UTF-8 string?)

------------------------------------------------------------------------

[2008-03-05 14:10:16] frode at coretrek dot com

Description:
------------
The "/u" modifier with preg_replace() yields extraordinarily poor
performance when there are a lot of matches.

I tried to run php through valgrind/callgrind and kcachegrind, and it
seems that the time is mostly spent in "php__pcre_valid_utf8()", Perhaps
this method is (unnecessarily) called over and over again, once for each
substring match/replace?

This happens at least in both PHP 5.2.5 and PHP 5.2.6RC1

Reproduce code:
---------------
<?php
$goodstring = str_repeat('Test',  50000);
$badstring  = str_repeat('Test ', 50000);

$t = microtime(true);
preg_replace('/\\s+/', ' ', $goodstring);
echo "matches: NO  unicode: NO  ".(microtime(true)-$t)." sec\n";

$t = microtime(true);
preg_replace('/\\s+/u', ' ', $goodstring);
echo "matches: NO  unicode: YES ".(microtime(true)-$t)." sec\n";

$t = microtime(true);
preg_replace('/\\s+/', ' ', $badstring);
echo "matches: YES unicode: NO  ".(microtime(true)-$t)." sec\n";

$t = microtime(true);
preg_replace('/\\s+/u', ' ', $badstring);
echo "matches: YES unicode: YES ".(microtime(true)-$t)." sec\n";    


Expected result:
----------------
Similar performance for all runs (less than 1 second)

Actual result:
--------------
matches: NO  unicode: NO  0.020231962204 sec
matches: NO  unicode: YES 0.0206818580627 sec
matches: YES unicode: NO  0.0361981391907 sec
matches: YES unicode: YES 27.6555769444 sec



------------------------------------------------------------------------


-- 
Edit this bug report at http://bugs.php.net/?id=44336&edit=1

Reply via email to