I've been thinking about how to not force UTF-8 in PCRE for PHP 6, and it's not that simple. This is mainly due to preg_replace(), because it allows array() parameters that can contain mixed IS_UNICODE and IS_STRING values. I hope you realize though, that in UTF-8 mode PCRE does not care about POSIX locales, even in PHP 5.

I haven't though on that, but can't you simply reject mixing of unicode and binary strings?


By the way, I think ICU regexp extension, when implemented, will let you match Portuguese characters in UTF-8 strings.

I wasn't aware of that API.. anyway it is probably slower than pcre+locales (because it uses unicode propertie table lookups)


Yes, UTF-8 covers many aspects but does it know about words, white
spaces (not sure if ws are always the same)  and other locale specific
issues?  generally, not only pcre. Maybe it is more something  for ICU
directly, as you said later in this thread.

That's not really a problem with pcre, as it supports unicode character properties. It isn't documented in phpdoc (don't look at me :P), but it looks like:
\pL
where L is one of (from http://pcre.org/pcre.txt):
        L     Letter
        Ll    Lower case letter
        N     Number
        Nd    Decimal number
        Nl    Letter number
        No    Other number
        P     Punctuation
        Zs    Space separator
(...)


Nuno
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to