[PHP-DEV] Proposal for better UTF-8 handling

Rouven Weßling Thu, 23 May 2013 18:18:33 -0700

Hi Internals!

First let me introduce myself, my name is Rouven Weßling, I'm a student at RWTH 
Aachen University and I'm one of the maintainers of the Joomla! Framework (née 
Platform). I've been following the internals list for a few months and started 
brushing of my C skills for the past couple of months so I can start 
contributing.


To me one of the most annoying things about working with PHP is the (lack of) 
unicode support. In Joomla! we've been discussing switching from PHP UTF-8 to 
Patchwork UTF-8 for our needs of handling UTF-8. Both are libraries abstracting 
the multibyte extension and supplementing it with a number of functions. They 
also provide userland replacements for when multibyte is not available 
(Patchwork will also use iconv and intl if available). All of this is a huge 
pain.

To ease this situation I'd like to make a new start at better unicode support 
for PHP, this time focusing on UTF-8 as the dominant web encoding. As a first 
step I'd like to propose adding a set of functions for handling UTF-8 strings. 
This should keep applications from implementing these algorithms in PHP (also 
many of these are quite a bit faster, see benchmark results below). Once the 
algorithms are in place I'd like to look into creating a class for unicode 
strings and eventually Python like unicode literals.

Before I write an RFC I'd like to get some feedback what you think about adding 
the following functions to PHP 5.6 (possibly more to follow): utf8_is_valid, 
utf8_strlen,  utf8_substr, utf8_strpos, utf8_strrpos, utf8_str_split, 
utf8_strrev, utf8_recover, utf8_chr, utf8_ord, string_is_ascii.

Most of them (exceptions are utf8_chr, utf8_is_valid, utf8_recover and 
string_is_ascii) are currently written in a way that they emit a warning when 
they encounter invalid UTF-8 and return with null. This should encourage 
applications to check their input with utf8_is_valid and either stop further 
processing or to fall back to utf8_recover to get a valid string. This should 
improve security since there are attack vectors when malformed sequences get 
interpreted as another encoding.

You can find the code I've written so far here: 
https://github.com/realityking/pecl-utf8
You can find benchmark results here: 
http://realityking.github.io/pecl-utf8/results.html

Best regards
Rouven
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

[PHP-DEV] Proposal for better UTF-8 handling

Reply via email to