Re: [PHP] preg_replace with UTF-8

SleePy Mon, 06 Jul 2009 12:08:18 -0700

Thank you Andrew,
That seems to break up UTF-8 strings. So from there I will play with it.


On Jul 6, 2009, at 8:50 AM, Andrew Ballard wrote:

On Sun, Jul 5, 2009 at 9:54 PM, SleePy<sleepingkil...@gmail.com>wrote:

I seem to be having a minor issue with preg_replace not working asexpectedwhen using UTF-8 strings. So far I have found out that \w doesn'tseem to be

detecting UTF-8 strings.

This is my test php file:
<?php
$data = 'ooooooooooooooooooooooo';
echo 'Data before: ', $data, '<br />';

$data = preg_replace('~([\w\.]{6})~u', '$1 < >', $data);
echo 'Data After: ', $data;

// UTF-8 Test
$data = 'ффффффффффффффффффффффф';
echo '<hr />Data before: ', $data, '<br />';

$data = preg_replace('~([\w\.]{6})~u', '$1 < >', $data);
echo 'Data After: ', $data;

?>


I would expect it to be:
Data before: ooooooooooooooooooooooo
Data After: oooooo < >oooooo < >oooooo < >ooooo
---
Data before: ффффффффффффффффффффффф
Data After: фффффф <>фффффф <>фффффф<> ффффф

But what I get is:
Data before: ooooooooooooooooooooooo
Data After: oooooo < >oooooo < >oooooo < >ooooo
---
Data before: ффффффффффффффффффффффф
Data After: ффффффффффффффффффффффф

Did I go about this the wrong way or is this a php bug itself?

I tested this in php 5.3, 5.2.9 and 6.0 (snapshot from a coupleweeks ago)

and received the same results.


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


From the manual on PCRE syntax:
"A 'word' character is any letter or digit or the underscore
character, that is, any character which can be part of a Perl 'word'.
The definition of letters and digits is controlled by PCRE's character
tables, and may vary if locale-specific matching is taking place. For
example, in the 'fr' (French) locale, some character codes greater
than 128 are used for accented letters, and these are matched by \w.

These character type sequences can appear both inside and outside
character classes. They each match one character of the appropriate
type. If the current matching point is at the end of the subject
string, all of them fail, since there is no character to match."

I'm not sure if this is exactly what you want (or if it might let more
things slip past than you intend), but try this:

<?php
$data = 'ooooooooooooooooooooooo';
echo 'Data before: ', $data, '<br />';

$data = preg_replace('~([\w\pL\.]{6})~u', '$1 < >', $data);
echo 'Data After: ', $data;

// UTF-8 Test
$data = 'ффффффффффффффффффффффф';
echo '<hr />Data before: ', $data, '<br />';

$data = preg_replace('~([\w\pL\.]{6})~u', '$1 < >', $data);
echo 'Data After: ', $data;

?>

Andrew



--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP] preg_replace with UTF-8

Reply via email to