Edit report at https://bugs.php.net/bug.php?id=53823&edit=1
ID: 53823
Comment by: robertbasic dot com at gmail dot com
Reported by: keith at chaos-realm dot net
Summary: preg_replace: * qualifier on unicode replace garbles
the string
Status: Verified
Type: Bug
Package: PCRE related
Operating System: Linux
PHP Version: 5.3SVN-2011-01-23 (snap)
Block user comment: N
Private report: N
New Comment:
I tried my best on this one. Tested against the trunk:
svn info | grep Revision
Revision: 323476
I created a test file for this, will attach.
I ran the following with gdb:
$ gdb sapi/cgi/php-cgi
and then set a breakpoint
(gdb) break php_pcre.c:1318
finally ran the test script like:
(gdb) run run-tests.php ext/pcre/tests/bug53823.phpt
On https://gist.github.com/1904467 I c/p-ed some output from gdb, but that
might be incorrect as I'm fairly new to all this. Anyway, lines 12 and 22 in
that gist caught my attention.
Also, I think the same issue exists for preg_filter, too.
Previous Comments:
------------------------------------------------------------------------
[2011-01-26 08:02:54] [email protected]
Verified on 5.3 and trunk.
------------------------------------------------------------------------
[2011-01-23 18:10:44] tino dot didriksen at gmail dot com
...and then I forget to change the *. Let's try that again...
These work as expected:
echo preg_replace('/[^\pL\pM]+/iu', '', 'áéÃóú');
echo preg_replace('/[^\pL\pM\pN]+/iu', '', 'áéÃóú');
------------------------------------------------------------------------
[2011-01-23 18:09:23] tino dot didriksen at gmail dot com
A workaround is to use + instead of *.
These work as expected:
echo preg_replace('/[^\pL\pM]*/iu', '', 'áéÃóú');
echo preg_replace('/[^\pL\pM\pN]*/iu', '', 'áéÃóú');
------------------------------------------------------------------------
[2011-01-23 18:04:49] keith at chaos-realm dot net
.
------------------------------------------------------------------------
[2011-01-23 18:00:57] keith at chaos-realm dot net
Description:
------------
When using the following test script to strip out all unicode except for
letters the string becomes garbled when the * qualifier is added, the only
surviving character that is intact is ú.
Also, if you add \pN to the exceptions it additionally preserves the ó.
Verified on 5.2,5.3 and 5.3-SNAP.
Test script:
---------------
echo preg_replace('/[^\pL\pM]*/iu', '', 'áéÃóú');
or
echo preg_replace('/[^\pL\pM\pN]*/iu', '', 'áéÃóú');
Expected result:
----------------
áéÃóú
Actual result:
--------------
����ú
or
���óú (if \pN is added to the exceptions).
------------------------------------------------------------------------
--
Edit this bug report at https://bugs.php.net/bug.php?id=53823&edit=1