UTF-8) and many hits = very bad performance

andrei Tue, 13 Jan 2009 11:24:37 -0800

 ID:               44336
 Updated by:       [email protected]
 Reported By:      frode at coretrek dot com
 Status:           Closed
 Bug Type:         PCRE related
 Operating System: Debian GNU/Linux 4.0r3
 PHP Version:      5.2.6RC1
 Assigned To:      nlopess
 New Comment:


Fixed in PHP_5_2 now as well.


Previous Comments:
------------------------------------------------------------------------

[2008-03-08 12:04:31] [email protected]

This bug has been fixed in CVS.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.

this went to the PHP 5.3 and 6 branches only and won't be ported to PHP
5.2.

------------------------------------------------------------------------

[2008-03-07 08:23:54] frode at coretrek dot com

Thanks :)

Do you have any idea if there's a chance to get this fix into PHP-5.2.6
before release?

------------------------------------------------------------------------

[2008-03-05 16:34:41] [email protected]

Nice work! :) 

Assigned to maintainer.

------------------------------------------------------------------------

[2008-03-05 15:46:27] frode at coretrek dot com

Here's the patch. If it doesn't come through cleanly, it's also
available at http://apollo.coretrek.com/~frode/phpbug-44336.patch.txt

--- php_pcre.c.orig     2008-03-05 16:37:09.000000000 +0100
+++ php_pcre.c  2008-03-05 16:38:18.000000000 +0100
@@ -599,20 +599,23 @@
 
        match = NULL;
        matched = 0;
        PCRE_G(error_code) = PHP_PCRE_NO_ERROR;
        
        do {
                /* Execute the regular expression. */
                count = pcre_exec(pce->re, extra, subject, subject_len,
start_offset,
                                                  exoptions|g_notempty, 
offsets, size_offsets);
 
+               /* Prevent lengthy UTF8 check on subsequent pcre_exec() calls to
save time (See PHP bug 44336) */
+               exoptions |= PCRE_NO_UTF8_CHECK;
+               
                /* Check for too many substrings condition. */  
                if (count == 0) {
                        php_error_docref(NULL TSRMLS_CC, E_NOTICE, "Matched, 
but too many
substrings");
                        count = size_offsets/3;
                }
 
                /* If something has matched */
                if (count > 0) {
                        matched++;
                        match = subject + offsets[0];
@@ -1002,20 +1005,23 @@
        match = NULL;
        *result_len = 0;
        start_offset = 0;
        PCRE_G(error_code) = PHP_PCRE_NO_ERROR;
        
        while (1) {
                /* Execute the regular expression. */
                count = pcre_exec(pce->re, extra, subject, subject_len,
start_offset,
                                                  exoptions|g_notempty, 
offsets, size_offsets);
                
+               /* Prevent lengthy UTF8 check on subsequent pcre_exec() calls to
save time (See PHP bug 44336) */
+               exoptions |= PCRE_NO_UTF8_CHECK;
+               
                /* Check for too many substrings condition. */
                if (count == 0) {
                        php_error_docref(NULL TSRMLS_CC,E_NOTICE, "Matched, but 
too many
substrings");
                        count = size_offsets/3;
                }
 
                piece = subject + start_offset;
 
                if (count > 0 && (limit == -1 || limit > 0)) {
                        if (replace_count) {
@@ -1439,20 +1445,23 @@
        last_match = subject;
        match = NULL;
        PCRE_G(error_code) = PHP_PCRE_NO_ERROR;
        
        /* Get next piece if no limit or limit not yet reached and something
matched*/
        while ((limit_val == -1 || limit_val > 1)) {
                count = pcre_exec(pce->re, extra, subject,
                                                  subject_len, start_offset,
                                                  exoptions|g_notempty, 
offsets, size_offsets);
 
+               /* Prevent lengthy UTF8 check on subsequent pcre_exec() calls to
save time (See PHP bug 44336) */
+               exoptions |= PCRE_NO_UTF8_CHECK;
+               
                /* Check for too many substrings condition. */
                if (count == 0) {
                        php_error_docref(NULL TSRMLS_CC,E_NOTICE, "Matched, but 
too many
substrings");
                        count = size_offsets/3;
                }
                                
                /* If something matched */
                if (count > 0) {
                        match = subject + offsets[0];

------------------------------------------------------------------------

[2008-03-05 15:44:49] frode at coretrek dot com

According to ext/pcre/pcrelib/doc/pcre.txt and
ext/pcre/pcrelib/ChangeLog there is a flag PCRE_NO_UTF8_CHECK which was
added in libpcre 4.5:

> 3. When matching a UTF-8 string, the test for a valid string at the
>start has
>    been extended. If start_offset is not zero, PCRE now checks that
>it points
>    to a byte that is the start of a UTF-8 character. If not, it
>returns
>    PCRE_ERROR_BADUTF8_OFFSET (-11). Note: the whole string is still
>checked;
>    this is necessary because there may be backward assertions in the
>pattern.
>    When matching the same subject several times, it may save
>resources to use
>    PCRE_NO_UTF8_CHECK on all but the first call if the string is
>long.

I tried patching ext/pcre/php_pcre.c and adding this PCRE_NO_UTF8_CHECK
to the options passed to pcre_exec() (setting the flag after the first
call to pcre_exec()) and it speeds up execution tremendously.

With the patch I now get:

matches: NO  unicode: NO  0.024386882781982 sec
matches: NO  unicode: YES 0.021436929702759 sec
matches: YES unicode: NO  0.060844898223877 sec
matches: YES unicode: YES 0.062279939651489 sec

I'll attach the patch shortly, but someone should review it to make
sure it doesn't open up any security holes or buffer overflows (for
example, would it be possible to create an invalid UTF-8 string by using
a replacement pattern containing an invalid UTF-8 string?)

------------------------------------------------------------------------

The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
    http://bugs.php.net/44336

-- 
Edit this bug report at http://bugs.php.net/?id=44336&edit=1

#44336 [Csd]: preg_replace with /u (unicode/UTF-8) and many hits = very bad performance

Reply via email to