ID:               40871
 User updated by:  ismith at motorola dot com
 Reported By:      ismith at motorola dot com
 Status:           Bogus
 Bug Type:         PCRE related
 Operating System: Windows Server 2003 SP1
 PHP Version:      5.2.1
 New Comment:

Further info:

I emailed the PCRE maintainer, and he said that since PCRE doesn't do
the replacement part, PCRE itself isn't dumping the text.  Apparently
when PCRE sees bad UTF8, it returns an error code (I believe
PCRE_ERROR_BADUTF8).

I think the text is getting lost by php_pcre_replace_impl.  If
pcre_exec returns PCRE_ERROR_NOMATCH, it saves all the unmatched text in
the result; but if pcre_exec returns some other error code, it looks to
me like it's dumping the result (which matches what I'm seeing).

I don't see how PHP can do much else than what it's doing; without a
match count back from pcre_exec, it can't process the replacements in
any case.

My feeling is that PCRE should not return an error code in this case,
but work around the bad UTF-8 character, which would be more in keeping
with the Unicode standard.  I'll discuss this further with the PCRE
folks.  OTOH, maybe MediaWiki should do UTF-8 cleanup on the string
before giving it to PHP.


Previous Comments:
------------------------------------------------------------------------

[2007-03-20 20:16:57] [EMAIL PROTECTED]

>Where do I report this?  How do I get it fixed?

See http://pcre.org, further details I don't know myself.

------------------------------------------------------------------------

[2007-03-20 20:03:58] ismith at motorola dot com

Tony, thanks for the response... but more info would be good.  Where do
I report this?  How do I get it fixed?

------------------------------------------------------------------------

[2007-03-20 20:00:17] ismith at motorola dot com

BTW, this bug surfaced in MediaWiki 1.9.3 on a private wiki, where it
causes some pages with pasted-in Windows quotes to be displayed as
blank.

------------------------------------------------------------------------

[2007-03-20 19:58:25] [EMAIL PROTECTED]

This is what the underlying PCRE library returns.

------------------------------------------------------------------------

[2007-03-20 19:54:33] ismith at motorola dot com

Description:
------------
I am using preg_replace to do a search and replace on some text which
contains an invalid UTF-8 code sequence.  I am using the "u" modifier.

I believe that preg_replace should suppress the bad character, or
replace it with an appropriate error marker; but otherwise return the
text intact (after making the required replacements).

Both preg_replace and preg_replace_callback return an empty string in
this case, even when the search pattern matches nothing in the input.


Reproduce code:
---------------
<?php

// Text with a valid UTF-8 character sequence.
$goodText = "I hate WOMBATS \342\200\234 and COWS";

// Text with an invalid UTF-8 character sequence.
$badText = "I love BEARS \342\200\077 and LIONS";

$good2 = preg_replace("/ELEPHANTS/iu", "MICE", $goodText);
printf("Was \"%s\"; now \"%s\"\n", $goodText, $good2);

$bad2 = preg_replace("/ELEPHANTS/iu", "MICE", $badText);
printf("Was \"%s\"; now \"%s\"\n", $badText, $bad2);

?>


Expected result:
----------------
Was "I hate WOMBATS &#915;Ç£ and COWS"; now "I hate WOMBATS &#915;Ç£
and COWS"
Was "I love BEARS &#915;Ç? and LIONS"; now "I love BEARS &#915;Ç? and
LIONS"


Actual result:
--------------
Was "I hate WOMBATS &#915;Ç£ and COWS"; now "I hate WOMBATS &#915;Ç£
and COWS"
Was "I love BEARS &#915;Ç? and LIONS"; now ""



------------------------------------------------------------------------


-- 
Edit this bug report at http://bugs.php.net/?id=40871&edit=1

Reply via email to