Edit report at http://bugs.php.net/bug.php?id=52971&edit=1
ID: 52971 User updated by: marc dot bennewitz at giata dot de Reported by: marc dot bennewitz at giata dot de Summary: PCRE-Meta-Characters not working with utf-8 Status: Bogus Type: Bug Package: PCRE related Operating System: Linux PHP Version: 5.3.3 Block user comment: N New Comment: There are some problems with it: 1. On windows it works as expected 2. With Unicode properties there is no word boundary (\w \W) 3. With the modifier "u" php knows that the subject is UTF-8 4. http://php.net/manual/regexp.reference.escape.php there is no note for UTF-8 incompatibility php.exe -i ... iconv iconv support => enabled iconv implementation => "libiconv" iconv library version => 1.11 Directive => Local Value => Master Value iconv.input_encoding => ISO-8859-1 => ISO-8859-1 iconv.internal_encoding => ISO-8859-1 => ISO-8859-1 iconv.output_encoding => ISO-8859-1 => ISO-8859-1 ... pcre PCRE (Perl Compatible Regular Expressions) Support => enabled PCRE Library Version => 8.02 2010-03-19 Directive => Local Value => Master Value pcre.backtrack_limit => 100000 => 100000 pcre.recursion_limit => 100000 => 100000 ... Previous Comments: ------------------------------------------------------------------------ [2010-10-02 20:26:05] cataphr...@php.net This is by design, it's the way \b and \w are defined in PCRE. You'll have to use another strategy, like look behind and unicode character properties. ------------------------------------------------------------------------ [2010-10-02 17:58:41] marc dot bennewitz at giata dot de Description: ------------ PCRE-Meta-Characters like \b \w not working with unicode strings. PHP-5.3.3 (32Bit) pcre PCRE (Perl Compatible Regular Expressions) Support => enabled PCRE Library Version => 8.02 2010-03-19 Directive => Local Value => Master Value pcre.backtrack_limit => 100000 => 100000 pcre.recursion_limit => 100000 => 100000 iconv iconv support => enabled iconv implementation => glibc iconv library version => 2.10.1 Directive => Local Value => Master Value iconv.input_encoding => ISO-8859-1 => ISO-8859-1 iconv.internal_encoding => ISO-8859-1 => ISO-8859-1 iconv.output_encoding => ISO-8859-1 => ISO-8859-1 Test script: --------------- <?php // encoding: UTF-8 $message = 'Der ist ein SüÃwasserpool Süsswasserpool ... verschiedene Wassersportmöglichkeiten bei ...'; $pattern = '/\bwasser/iu'; preg_match_all($pattern, $message, $match, PREG_OFFSET_CAPTURE); var_dump($match); $pattern = '/[^\w]wasser/iu'; preg_match_all($pattern, $message, $match, PREG_OFFSET_CAPTURE); var_dump($match); Expected result: ---------------- array(1) { [0]=> array(1) { [0]=> array(2) { [0]=> string(6) "Wasser" [1]=> int(61) } } } array(1) { [0]=> array(1) { [0]=> array(2) { [0]=> string(7) " Wasser" [1]=> int(60) } } } Actual result: -------------- array(1) { [0]=> array(2) { [0]=> array(2) { [0]=> string(6) "wasser" [1]=> int(17) } [1]=> array(2) { [0]=> string(6) "Wasser" [1]=> int(61) } } } array(1) { [0]=> array(2) { [0]=> array(2) { [0]=> string(8) "Ãwasser" [1]=> int(15) } [1]=> array(2) { [0]=> string(7) " Wasser" [1]=> int(60) } } } ------------------------------------------------------------------------ -- Edit this bug report at http://bugs.php.net/bug.php?id=52971&edit=1