Hi all,
I'm new to this list, so let me introduce myself: I'm one of the
internationalization architects at Yahoo, and have recently started
looking into PHP internationalization. I previously worked on Java
internationalization at Sun for a few years.
Resource bundles have helped make internationalization of Java
applications easy and popular, so I'd like to see a similar
capability in PHP. I know gettext is available, but it seems a bit
difficult to understand and uses locale specific encodings instead of
Unicode.
The ICU style of resource bundles is Unicode based, but also seems
more complicated than desirable for PHP. It's designed for a
statically typed environment and requires compilation, neither of
which fit in well with PHP.
I'd rather start with Java properties files, the simplest and most
widely used form of Java resource files. I'd adapt them to PHP 6 by
switching their encoding to UTF-8, adopting heredocs, and simplifying
their syntax. I'd drop the secondary fallback mechanism, in which
resource bundles can inherit individual resources from other bundles.
This is an optimization to reduce the size of bundles at the expense
of runtime overhead and additional work in creating the bundles. The
additional step of finding common resources and moving them to shared
bundles is rarely made in normal localization processes, and the
space savings don't matter much for PHP, where bundles remain on the
server. Dropping the secondary fallback also means that we can
simplify the resource bundle to just an array.
Below is my draft specification for a resource bundle mechanism for
PHP. For comparison, the specs for the corresponding functionality in
Java and ICU4C are:
- ICU resource bundle specification:
http://icu.sourceforge.net/apiref/icu4c/ures_8h.html
- ICU resource file specification:
http://dev.icu-project.org/cgi-bin/viewcvs.cgi/icuhtml/design/
bnf_rb.txt?view=co
- Java resource bundle specification:
http://java.sun.com/javase/6/docs/api/java/util/ResourceBundle.html
- Java properties file specification:
http://java.sun.com/javase/6/docs/api/java/util/
Properties.html#load(java.io.Reader)
API:
array intl_get_resources(string base_name)
Returns an array containing the key-value pairs obtained from a
resource file. The resource file is looked up for the current locale
using the lookup algorithm of section 3.4 of RFC 4647, at each step
generating the file name by concatenating the given base name, the
string "-", the language tag of the current step, and the string
".pres". The default file name used if no previous step was
successful is the concatenation of base name and ".pres". Once a file
is found, its contents are interpreted according to the resource file
format specified below, and an array is filled with its key-value
pairs. An entry with "#locale#" as its key and the actual locale tag
of the file found as its value is added to the array, and the array
is returned. The function may cache its results, but must check at
least once every 60 minutes that the underlying resource files
haven't changed.
Resource File Format
- Files are encoded in UTF-8. The first line may be prefixed with a BOM.
- Lines whose first non-whitespace character is "#" are comment lines
and are ignored.
- Lines that contain only whitespace characters and are not part of a
heredoc string are ignored.
- Key-value definitions come in two forms:
o The simple form has a key string, followed by "=",
followed by the value, all on one line. The tokens may or may not be
surrounded by whitespace characters. Leading and trailing whitespace
is trimmed from both key and value. The value cannot start with
"<<<"; for values starting with this character sequence, use the
heredoc form.
o The heredoc form starts with a key string, followed by
"=", followed by "<<<", followed by an identifier, all on one line.
The tokens may or may not be surrounded by whitespace characters.
Leading and trailing whitespace is trimmed from both key and value.
The heredoc form ends with a termination line that contains only the
identifier, possibly followed by a semicolon. The lines between these
two lines, except comment lines, form the heredoc string. The line
break before the termination line is removed, all other line breaks
are preserved.
- Lines that are not comment lines, whitespace lines, or part of a
key-value definition are illegal.
- The following escape sequences are recognized in values:
o "\\" stands for "\"
o "\n" stands for the newline character, U+000A.
o "\t" stands for the horizontal tab character, U+0009.
o "\ " stands for the space character, U+0020. This is
only needed if the value of a key-value pair starts or ends with a
space character.
o "\#" stands for the number sign character, U+0023. This
is only needed if a line within a heredoc string starts with this
character.
- A sequence of "\" followed by a character not listed above is
illegal. A "\" immediately preceding the end of the file is illegal.
- Only the characters horizontal tab, U+0009, and space, U+0020, are
considered whitespace.
With that, hello world becomes:
<?php
$strings = intl_get_resources("strings");
echo "$strings[hello]";
?>
The strings.pres file contains:
hello = Hello, world!
and strings-ja.pres contains:
hello = こんにちは、皆さん。
What do you think?
Regards,
Norbert
--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php