Hello! I've been looking at using locale and collation now we have ICU. Please let me know if you have any comments:
Locale Functions with Unicode ============================= Introduction ------------ The Unicode design document lists that all current functions that can make use of a locale (such as strtoupper) are not going to be implemented in a locale aware way. Although this will work for most situations, it might break BC for a few situations. One (popular) example is: :: <?php setlocale(LC_ALL, 'tr_TR'); echo strtoupper('hans blix'), "\n"; ?> In PHP 4 and 5 this returns (when viewing in iso-8859-9): :: HANS BLİX Where in PHP 6 this currently returns: :: HANS BLIX The string returned for PHP 4 and 5 is the correct one for Turkish. See also note 1. Locale Dependent Functions -------------------------- There are other functions that deal with the locale settings, some in a different way. A list of functions and how they use the system locale. Array Sorting Functions ~~~~~~~~~~~~~~~~~~~~~~~ All the array sorting functions accept a flag "SORT_LOCALE_STRING" that changes the sorting of array keys/value from a binary compare, to a locale based compare. This uses the function strcoll(), which relies on the system's locale. String Functions ~~~~~~~~~~~~~~~~ str_word_count Uses the system locale to determine which characters make up a word. strnatcasecmp, strnatcmp Use the locale to upper and lower case letters, and to determine if something is a digit or not. strcmp, strncmp Do currently not use any locale, but perhaps they can make use of it, f.e. in the ß vs ss case. strcasecmp, strncasecmp Uses the system locale to do lower casing on letters so that they can match case-insensitive. See also note 2. strtolower, strtoupper Make both use of locale properties for characters to lower/upper case them properly. ucfirst, ucwords Use character properties to upper and lower case the first letters of words. Other Functions ~~~~~~~~~~~~~~~ localeconv Uses the system locale to return information about this locale. money_format Uses the system locale to format a number as monetary number. Problems with System Locales ---------------------------- There are a number of problems with having to rely on the locale information that is available on different platforms / installations. Locale information: - can be different for each platform - might not available depending on platform and installation - does not have a common identifier on different platforms ICU Locales and Collators ------------------------- As ICU provides us with a platform and installation independent way of dealing with locales and collation rules, we can use this to get rid of the current dependency on system locales. There are three ways how we can upgrade our functions to use ICU locales: 1. We simply make them use the default locale, as set by icu_loc_set_default() and default collator (as set by a future icu_coll_set_default()). 2. We add a new parameter to the functions specifying which locale to use. 3. Create new functions that are locale and collation dependent (by using the default locale/collation). Each of those three options have pro's and con's. Modifying Functions to Use ICU Locales ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ pro: - No additional programming needed by users as the current functions would "just work like expected". For people that do not care about locales, nothing will really change, as the current default locale should be "C" or "POSIX". - No ugly API for our string handling functions. con: - It might break BC in some cases. Adding a New Argument to Functions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ pro: - Doesn't break BC con: - Additional work for programmers for every function call. - Ugly API because of the passing of the locale name. Create New Functions ~~~~~~~~~~~~~~~~~~~~ pro: - Doesn't break BC - No ugly API con: - Additional work for programmers as they need to replace the current functions with the upgraded ones. - It is crucial that the new functions can not be disabled, because of portability. - We need to come up with a good prefix for those. - The new functions need to work when Unicode semantics are turned off. Discussion ---------- Both the first and third options would in my opinion be acceptable, where I would prefer the first one, as it gives as little headache as possible for users to start using locales. This approach would well work for the String Functions. For the array sorting function, I would prefer that the current "SORT_LOCALE_STRING" simply starts using the ICU collation functionality, as it's a relatively new flag. Another solution would be to create a new flag for this, "SORT_ICU_LOCALE_STRING" that make the sorting functions use the collation functionality provided by ICU. For the Other Functions we should create a new function to format numbers in a locale-aware way, as it would be very hard to make the current money_format compatible with ICU and still give the full possibilities of ICU's numbering formatting functionality. Other Functions' Implementation ------------------------------- i18n_format_number($number, $type [, $custom_format]) A wrapper around ICU's unum.h C-API (http://icu.sourceforge.net/apiref/icu4c/unum_8h.html) that allows you to format numbers in locale specific ways. i18n_parse_number($number, $type [, $custom_format]) A wrapper around the number parsing routines from unum.h Notes: ------ 1. For some reason, in PHP 6, the strtoupper() function *does* make use of the locale though: By setting the locale with icu_loc_set_default("tr_TR") the PHP 6 example gives the correct result: :: <?php icu_loc_set_default("tr_TR"); echo strtoupper('hans blix'), "\n"; ?> Shows: :: HANS BLİX 2. the function zend_u_binary_strncmp doesn't compare anything binary, as it uses U16_NEXT. Why do we still call it u_binary_strncmp? regards, Derick -- Derick Rethans http://derickrethans.nl | http://ez.no | http://xdebug.org
-- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php