Hi all,

I'm new to this list, so let me introduce myself: I'm one of the internationalization architects at Yahoo, and have recently started looking into PHP internationalization. I previously worked on Java internationalization at Sun for a few years.

Resource bundles have helped make internationalization of Java applications easy and popular, so I'd like to see a similar capability in PHP. I know gettext is available, but it seems a bit difficult to understand and uses locale specific encodings instead of Unicode.

The ICU style of resource bundles is Unicode based, but also seems more complicated than desirable for PHP. It's designed for a statically typed environment and requires compilation, neither of which fit in well with PHP.

I'd rather start with Java properties files, the simplest and most widely used form of Java resource files. I'd adapt them to PHP 6 by switching their encoding to UTF-8, adopting heredocs, and simplifying their syntax. I'd drop the secondary fallback mechanism, in which resource bundles can inherit individual resources from other bundles. This is an optimization to reduce the size of bundles at the expense of runtime overhead and additional work in creating the bundles. The additional step of finding common resources and moving them to shared bundles is rarely made in normal localization processes, and the space savings don't matter much for PHP, where bundles remain on the server. Dropping the secondary fallback also means that we can simplify the resource bundle to just an array.

Below is my draft specification for a resource bundle mechanism for PHP. For comparison, the specs for the corresponding functionality in Java and ICU4C are:
- ICU resource bundle specification:
    http://icu.sourceforge.net/apiref/icu4c/ures_8h.html
- ICU resource file specification:
http://dev.icu-project.org/cgi-bin/viewcvs.cgi/icuhtml/design/ bnf_rb.txt?view=co
- Java resource bundle specification:
    http://java.sun.com/javase/6/docs/api/java/util/ResourceBundle.html
- Java properties file specification:
http://java.sun.com/javase/6/docs/api/java/util/ Properties.html#load(java.io.Reader)


API:

array intl_get_resources(string base_name)

Returns an array containing the key-value pairs obtained from a resource file. The resource file is looked up for the current locale using the lookup algorithm of section 3.4 of RFC 4647, at each step generating the file name by concatenating the given base name, the string "-", the language tag of the current step, and the string ".pres". The default file name used if no previous step was successful is the concatenation of base name and ".pres". Once a file is found, its contents are interpreted according to the resource file format specified below, and an array is filled with its key-value pairs. An entry with "#locale#" as its key and the actual locale tag of the file found as its value is added to the array, and the array is returned. The function may cache its results, but must check at least once every 60 minutes that the underlying resource files haven't changed.


Resource File Format

- Files are encoded in UTF-8. The first line may be prefixed with a BOM.
- Lines whose first non-whitespace character is "#" are comment lines and are ignored. - Lines that contain only whitespace characters and are not part of a heredoc string are ignored.
- Key-value definitions come in two forms:
o The simple form has a key string, followed by "=", followed by the value, all on one line. The tokens may or may not be surrounded by whitespace characters. Leading and trailing whitespace is trimmed from both key and value. The value cannot start with "<<<"; for values starting with this character sequence, use the heredoc form. o The heredoc form starts with a key string, followed by "=", followed by "<<<", followed by an identifier, all on one line. The tokens may or may not be surrounded by whitespace characters. Leading and trailing whitespace is trimmed from both key and value. The heredoc form ends with a termination line that contains only the identifier, possibly followed by a semicolon. The lines between these two lines, except comment lines, form the heredoc string. The line break before the termination line is removed, all other line breaks are preserved. - Lines that are not comment lines, whitespace lines, or part of a key-value definition are illegal.
- The following escape sequences are recognized in values:
          o "\\" stands for "\"
          o "\n" stands for the newline character, U+000A.
          o "\t" stands for the horizontal tab character, U+0009.
o "\ " stands for the space character, U+0020. This is only needed if the value of a key-value pair starts or ends with a space character. o "\#" stands for the number sign character, U+0023. This is only needed if a line within a heredoc string starts with this character. - A sequence of "\" followed by a character not listed above is illegal. A "\" immediately preceding the end of the file is illegal. - Only the characters horizontal tab, U+0009, and space, U+0020, are considered whitespace.


With that, hello world becomes:

<?php
    $strings = intl_get_resources("strings");
    echo "$strings[hello]";
?>

The strings.pres file contains:
    hello = Hello, world!
and strings-ja.pres contains:
    hello = こんにちは、皆さん。

What do you think?

Regards,
Norbert

--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to