Edit report at http://bugs.php.net/bug.php?id=52971&edit=1

 ID:                 52971
 User updated by:    marc dot bennewitz at giata dot de
 Reported by:        marc dot bennewitz at giata dot de
 Summary:            PCRE-Meta-Characters not working with utf-8
 Status:             Bogus
 Type:               Bug
 Package:            PCRE related
 Operating System:   Linux
 PHP Version:        5.3.3
 Block user comment: N

 New Comment:

There are some problems with it:

1. On windows it works as expected

2. With Unicode properties there is no word boundary (\w \W)

3. With the modifier "u" php knows that the subject is UTF-8

4. http://php.net/manual/regexp.reference.escape.php there is no note
for UTF-8 incompatibility



php.exe -i

...

iconv



iconv support => enabled

iconv implementation => "libiconv"

iconv library version => 1.11



Directive => Local Value => Master Value

iconv.input_encoding => ISO-8859-1 => ISO-8859-1

iconv.internal_encoding => ISO-8859-1 => ISO-8859-1

iconv.output_encoding => ISO-8859-1 => ISO-8859-1

...

pcre



PCRE (Perl Compatible Regular Expressions) Support => enabled

PCRE Library Version => 8.02 2010-03-19



Directive => Local Value => Master Value

pcre.backtrack_limit => 100000 => 100000

pcre.recursion_limit => 100000 => 100000

...


Previous Comments:
------------------------------------------------------------------------
[2010-10-02 20:26:05] cataphr...@php.net

This is by design, it's the way \b and \w are defined in PCRE.



You'll have to use another strategy, like look behind and unicode
character properties.

------------------------------------------------------------------------
[2010-10-02 17:58:41] marc dot bennewitz at giata dot de

Description:
------------
PCRE-Meta-Characters like \b \w not working with unicode strings.



PHP-5.3.3 (32Bit)

pcre



PCRE (Perl Compatible Regular Expressions) Support => enabled

PCRE Library Version => 8.02 2010-03-19



Directive => Local Value => Master Value

pcre.backtrack_limit => 100000 => 100000

pcre.recursion_limit => 100000 => 100000



iconv



iconv support => enabled

iconv implementation => glibc

iconv library version => 2.10.1



Directive => Local Value => Master Value

iconv.input_encoding => ISO-8859-1 => ISO-8859-1

iconv.internal_encoding => ISO-8859-1 => ISO-8859-1

iconv.output_encoding => ISO-8859-1 => ISO-8859-1



Test script:
---------------
<?php // encoding: UTF-8



$message = 'Der ist ein Süßwasserpool Süsswasserpool ... verschiedene
Wassersportmöglichkeiten bei ...';



$pattern = '/\bwasser/iu';

preg_match_all($pattern, $message, $match, PREG_OFFSET_CAPTURE);

var_dump($match);



$pattern = '/[^\w]wasser/iu';

preg_match_all($pattern, $message, $match, PREG_OFFSET_CAPTURE);

var_dump($match);

Expected result:
----------------
array(1) {

  [0]=>

  array(1) {

    [0]=>

    array(2) {

      [0]=>

      string(6) "Wasser"

      [1]=>

      int(61)

    }

  }

}

array(1) {

  [0]=>

  array(1) {

    [0]=>

    array(2) {

      [0]=>

      string(7) " Wasser"

      [1]=>

      int(60)

    }

  }

}

Actual result:
--------------
array(1) {

  [0]=>

  array(2) {

    [0]=>

    array(2) {

      [0]=>

      string(6) "wasser"

      [1]=>

      int(17)

    }

    [1]=>

    array(2) {

      [0]=>

      string(6) "Wasser"

      [1]=>

      int(61)

    }

  }

}

array(1) {

  [0]=>

  array(2) {

    [0]=>

    array(2) {

      [0]=>

      string(8) "ßwasser"

      [1]=>

      int(15)

    }

    [1]=>

    array(2) {

      [0]=>

      string(7) " Wasser"

      [1]=>

      int(60)

    }

  }

}


------------------------------------------------------------------------



-- 
Edit this bug report at http://bugs.php.net/bug.php?id=52971&edit=1

Reply via email to