Edit report at https://bugs.php.net/bug.php?id=37391&edit=1
ID: 37391
Comment by: harald dot lapp at gmail dot com
Reported by: mike at silverorange dot com
Summary: PREG_OFFSET_CAPTURE not UTF-8 aware when using u
modifier
Status: Not a bug
Type: Bug
Package: PCRE related
Operating System: Linux
PHP Version: 5.1.4
Block user comment: N
Private report: N
New Comment:
I am not sure, where the manual mentions, that PREG_OFFSET_CAPTURE is not
"UTF-8"
aware. And even if it was, it is still very, very, very annoying, Any chances,
that this behaviour could get changed?
Previous Comments:
------------------------------------------------------------------------
[2006-05-10 07:03:42] [email protected]
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php
.
------------------------------------------------------------------------
[2006-05-09 22:57:49] mike at silverorange dot com
Description:
------------
When using preg_match_all() with the PREG_OFFSET_CAPTURE flag, the returned
match offsets are in octets rather than characters.
PCRE is compiled with --enable-utf8 and I am using the u modifier in my regular
expression.
Reproduce code:
---------------
<?php
$matches = array();
$reg_exp = "/B/u";
// UTF8 represents A-euro-BC
$string = "A\xe2\x82\xacBC";
preg_match_all($reg_exp, $string, $matches, PREG_OFFSET_CAPTURE);
print_r($matches);
?>
Expected result:
----------------
Array
(
[0] => Array
(
[0] => Array
(
[0] => B
[1] => 2
)
)
)
Actual result:
--------------
Array
(
[0] => Array
(
[0] => Array
(
[0] => B
[1] => 4
)
)
)
------------------------------------------------------------------------
--
Edit this bug report at https://bugs.php.net/bug.php?id=37391&edit=1