Hi Internals! First let me introduce myself, my name is Rouven Weßling, I'm a student at RWTH Aachen University and I'm one of the maintainers of the Joomla! Framework (née Platform). I've been following the internals list for a few months and started brushing of my C skills for the past couple of months so I can start contributing.
To me one of the most annoying things about working with PHP is the (lack of) unicode support. In Joomla! we've been discussing switching from PHP UTF-8 to Patchwork UTF-8 for our needs of handling UTF-8. Both are libraries abstracting the multibyte extension and supplementing it with a number of functions. They also provide userland replacements for when multibyte is not available (Patchwork will also use iconv and intl if available). All of this is a huge pain. To ease this situation I'd like to make a new start at better unicode support for PHP, this time focusing on UTF-8 as the dominant web encoding. As a first step I'd like to propose adding a set of functions for handling UTF-8 strings. This should keep applications from implementing these algorithms in PHP (also many of these are quite a bit faster, see benchmark results below). Once the algorithms are in place I'd like to look into creating a class for unicode strings and eventually Python like unicode literals. Before I write an RFC I'd like to get some feedback what you think about adding the following functions to PHP 5.6 (possibly more to follow): utf8_is_valid, utf8_strlen, utf8_substr, utf8_strpos, utf8_strrpos, utf8_str_split, utf8_strrev, utf8_recover, utf8_chr, utf8_ord, string_is_ascii. Most of them (exceptions are utf8_chr, utf8_is_valid, utf8_recover and string_is_ascii) are currently written in a way that they emit a warning when they encounter invalid UTF-8 and return with null. This should encourage applications to check their input with utf8_is_valid and either stop further processing or to fall back to utf8_recover to get a valid string. This should improve security since there are attack vectors when malformed sequences get interpreted as another encoding. You can find the code I've written so far here: https://github.com/realityking/pecl-utf8 You can find benchmark results here: http://realityking.github.io/pecl-utf8/results.html Best regards Rouven -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php