Edit report at https://bugs.php.net/bug.php?id=62562&edit=1
ID: 62562 User updated by: magog dot the dot ogre at gmail dot com Reported by: magog dot the dot ogre at gmail dot com Summary: preg_replace mangles UTF8 string - Windows only Status: Analyzed Type: Bug Package: *Regular Expressions Operating System: Windows x86 PHP Version: 5.3.14 Block user comment: N Private report: N New Comment: Just curious: why was this marked as solved? Previous Comments: ------------------------------------------------------------------------ [2012-07-16 15:38:10] a...@php.net Btw. the PCRE version reported by PHP is 8.12, but the current is 8.30. May be a simple upgrade could solve this. ------------------------------------------------------------------------ [2012-07-16 15:19:54] a...@php.net I've tested your PHP snippet on win7, but it's probably the same on any win. The behaviour is as you describe. But there is another point. The string to be matched is hardcoded into the script as UTF-8, if you open that file in the ASCII mode, you'll see each byte, see here (saved to a file as teh BT ruinates all the view) http://belsky.info/phpz/bugz/62562/62562_3.txt Switch the encoding to UTF-8 in your browser and then to a non-multibyte one. Another way to do that - open the file under linux with vim -c 'set encoding=latin1' 62562_3.txt In both cases one can see, that one byte is interpreted as a space. Combined with no UTF-8 modifier the behaviour is expected, further more windows seems do do it right :) I've also debugged this under VS and it's definitely something coming back from the PCRE itself. Here http://lxr.php.net/xref/PHP_5_4/ext/pcre/php_pcre.c#621 is count > 0, so matched is incremented and returned some when. Nevertheless it could be a locale thing forcing PCRE to do UTF-8, but I actually don't see any locale dependent places in PCRE. Trying to boot linux with C locale might repro this there as well, I have no such mashines though. ------------------------------------------------------------------------ [2012-07-16 01:39:06] magog dot the dot ogre at gmail dot com Yeah, it works SunOS and Ubuntu for me too. Well if/when you get access to a Windows distro or another developer who has one comes along, then I guess you can work on this bug. :) ------------------------------------------------------------------------ [2012-07-15 22:43:01] ras...@php.net Well, I have looked at the code. We take the raw binary string and pass it straight to PCRE both on Windows and UNIX. So something along the way isn't the same. But I am not a Windows guy, so I can't help you on the Windows side of things. It works fine on my Linux box here. ------------------------------------------------------------------------ [2012-07-15 22:32:03] magog dot the dot ogre at gmail dot com OK then, after doing some more plugging around, it appears that it still might be a PHP issue. Correct me if I'm wrong, but here are my finding: Create a php file with only the following content: <?php echo preg_match("/\s+/", "ááá¤áá áááªáá")?"1":"0"; Running this on Windows will return "1", running on Unix returns "0". Now I've run this on PCRE, and PCRE has returned that there was no match. Thus, it may be a PHP issue. Here is the output: ***Contents of test.txt /\s+/ ááá¤áá áááªáá ááá¤áá áááªáá ***Output via Cygwin, running the Windows native pcretest.exe (redacted)@(redacted)-PC /cygdrive/c/Program Files (x86)/pcre-7.0-bin/bin $ ./pcretest.exe test.txt PCRE version 7.0 18-Dec-2006 /\s+/ ááá¤áá áááªáá No match ááá¤áá áááªáá 0: (I included the second example above with a space purposefully added, just to show that the tool is functioning properly and will catch the space when it's properly there). ------------------------------------------------------------------------ The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at https://bugs.php.net/bug.php?id=62562 -- Edit this bug report at https://bugs.php.net/bug.php?id=62562&edit=1