[PHP-DEV] ICU and Locale/Collations

Derick Rethans Wed, 31 Aug 2005 08:01:33 -0700

Hello!

I've been looking at using locale and collation now we have ICU. Please 
let me know if you have any comments:



Locale Functions with Unicode
=============================

Introduction
------------
The Unicode design document lists that all current functions that can make use
of a locale (such as strtoupper) are not going to be implemented in a locale
aware way. Although this will work for most situations, it might break BC for a
few situations. One (popular) example is: ::

        <?php
                setlocale(LC_ALL, 'tr_TR');
                echo strtoupper('hans blix'), "\n";
        ?>

In PHP 4 and 5 this returns (when viewing in iso-8859-9): ::

        HANS BLİX
        
Where in PHP 6 this currently returns: ::

        HANS BLIX

The string returned for PHP 4 and 5 is the correct one for Turkish.
See also note 1.


Locale Dependent Functions
--------------------------
There are other functions that deal with the locale settings, some in a
different way. A list of functions and how they use the system locale.

Array Sorting Functions
~~~~~~~~~~~~~~~~~~~~~~~
All the array sorting functions accept a flag "SORT_LOCALE_STRING" that changes
the sorting of array keys/value from a binary compare, to a locale based
compare. This uses the function strcoll(), which relies on the system's locale.

String Functions
~~~~~~~~~~~~~~~~
str_word_count
        Uses the system locale to determine which characters make up a word.

strnatcasecmp, strnatcmp
        Use the locale to upper and lower case letters, and to determine if
        something is a digit or not.

strcmp, strncmp
        Do currently not use any locale, but perhaps they can make use of it, 
f.e.
        in the ß vs ss case.

strcasecmp, strncasecmp
        Uses the system locale to do lower casing on letters so that they can 
match
        case-insensitive. See also note 2.

strtolower, strtoupper
        Make both use of locale properties for characters to lower/upper case 
them
        properly.

ucfirst, ucwords
        Use character properties to upper and lower case the first letters of
        words.

Other Functions
~~~~~~~~~~~~~~~
localeconv
        Uses the system locale to return information about this locale.

money_format
        Uses the system locale to format a number as monetary number.


Problems with System Locales
----------------------------
There are a number of problems with having to rely on the locale information
that is available on different platforms / installations.  Locale information:

- can be different for each platform
- might not available depending on platform and installation
- does not have a common identifier on different platforms


ICU Locales and Collators
-------------------------
As ICU provides us with a platform and installation independent way of dealing
with locales and collation rules, we can use this to get rid of the current
dependency on system locales. There are three ways how we can upgrade our
functions to use ICU locales:

1. We simply make them use the default locale, as set by icu_loc_set_default()
   and default collator (as set by a future icu_coll_set_default()).
2. We add a new parameter to the functions specifying which locale to use.
3. Create new functions that are locale and collation dependent (by using the
   default locale/collation).

Each of those three options have pro's and con's. 

Modifying Functions to Use ICU Locales
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
pro:

- No additional programming needed by users as the current functions would "just
  work like expected". For people that do not care about locales, nothing will
  really change, as the current default locale should be "C" or "POSIX".
- No ugly API for our string handling functions.

con:

- It might break BC in some cases.

Adding a New Argument to Functions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
pro:

- Doesn't break BC

con:

- Additional work for programmers for every function call.
- Ugly API because of the passing of the locale name.

Create New Functions
~~~~~~~~~~~~~~~~~~~~
pro:

- Doesn't break BC
- No ugly API

con:

- Additional work for programmers as they need to replace the current functions
  with the upgraded ones.
- It is crucial that the new functions can not be disabled, because of
  portability.
- We need to come up with a good prefix for those.
- The new functions need to work when Unicode semantics are turned off.


Discussion
----------
Both the first and third options would in my opinion be acceptable, where I
would prefer the first one, as it gives as little headache as possible for
users to start using locales. This approach would well work for the String
Functions.

For the array sorting function, I would prefer that the current
"SORT_LOCALE_STRING" simply starts using the ICU collation functionality, as
it's a relatively new flag. Another solution would be to create a new flag for
this, "SORT_ICU_LOCALE_STRING" that make the sorting functions use the
collation functionality provided by ICU.

For the Other Functions we should create a new function to format numbers in a
locale-aware way, as it would be very hard to make the current money_format
compatible with ICU and still give the full possibilities of ICU's numbering
formatting functionality.


Other Functions' Implementation
-------------------------------
i18n_format_number($number, $type [, $custom_format])
        A wrapper around ICU's unum.h C-API
        (http://icu.sourceforge.net/apiref/icu4c/unum_8h.html) that allows you 
to
        format numbers in locale specific ways.

i18n_parse_number($number, $type [, $custom_format])
        A wrapper around the number parsing routines from unum.h


Notes:
------
1. For some reason, in PHP 6, the strtoupper() function *does* make use of the
   locale though:

   By setting the locale with icu_loc_set_default("tr_TR") the PHP 6 example
   gives the correct result: ::

                <?php
                        icu_loc_set_default("tr_TR");
                        echo strtoupper('hans blix'), "\n";
                ?>

        Shows: ::

                HANS BLİX

2. the function zend_u_binary_strncmp doesn't compare anything binary, as it
   uses U16_NEXT. Why do we still call it u_binary_strncmp?


regards,
Derick

-- 
Derick Rethans
http://derickrethans.nl | http://ez.no | http://xdebug.org

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

[PHP-DEV] ICU and Locale/Collations

Reply via email to