I've been thinking about how to not force UTF-8 in PCRE for PHP 6, and
it's not that simple. This is mainly due to preg_replace(), because it
allows array() parameters that can contain mixed IS_UNICODE and IS_STRING
values. I hope you realize though, that in UTF-8 mode PCRE does not care
about POSIX locales, even in PHP 5.
I haven't though on that, but can't you simply reject mixing of unicode and
binary strings?
By the way, I think ICU regexp extension, when implemented, will let you
match Portuguese characters in UTF-8 strings.
I wasn't aware of that API.. anyway it is probably slower than pcre+locales
(because it uses unicode propertie table lookups)
Yes, UTF-8 covers many aspects but does it know about words, white
spaces (not sure if ws are always the same) and other locale specific
issues? generally, not only pcre. Maybe it is more something for ICU
directly, as you said later in this thread.
That's not really a problem with pcre, as it supports unicode character
properties. It isn't documented in phpdoc (don't look at me :P), but it
looks like:
\pL
where L is one of (from http://pcre.org/pcre.txt):
L Letter
Ll Lower case letter
N Number
Nd Decimal number
Nl Letter number
No Other number
P Punctuation
Zs Space separator
(...)
Nuno
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php