Comments on internationalization API

Norbert Lindenberg Tue, 19 Jul 2011 01:29:39 -0700

Hi all,

I'm sorry for not having been able to contribute to the internationalization 
API earlier. I finally have reviewed the straw man [1], and am pleased to see 
that it contains a good subset of internationalization functionality to start 
with. Number and date formatting and collation are issues that most 
applications have to deal with. Collation especially, but also date formatting 
with support for multiple time zones and calendars are hard to implement as 
downloadable libraries.


I have some comments on the details though:

1. In the background section, it might be useful to add that with Node.js 
server-side JavaScript is seeing a rebound, and applications don't really want 
to have to call out to a non-JavaScript server in order to handle basic 
internationalization.

2. In the goals section, I'd qualify the "reuse of objects" goal as a reuse of 
implementation data structures, or even better replace it with measurable 
performance goals. Reuse of objects that are visible to applications has 
security and privacy implications, especially when loading third party code 
(apps or ads) onto pages [2]. I'd recommend letting applications freely 
construct Collator, NumberFormat, and DateTimeFormat objects, but have these 
objects share implementation objects (such as ICU objects) as much as possible. 
If the API does return shared objects, the security issues need to be dealt 
with, e.g., by specifying that the shared objects are immutable.

3. I'm very uncomfortable with the LocaleInfo class. It seems to pretend being 
the central source of all locale-related information, but can't live up to that 
claim because its design is limited to number and date formatting and 
collation. Developers will need to create other functionality such as text 
segmentation, spelling checking, message lookup, shoe size conversion, etc. 
LocaleInfo appears to perform some magic to derive regions, currencies, and 
possibly time zones, but doesn't specify it, and makes none of it available to 
other internationalization classes. It also does duty as a namespace, which 
looks odd in an EcmaScript standard that otherwise doesn't know namespaces.

Other internationalization libraries have a core that anybody can build on to 
create internationalization functionality. In Java, for example, the Locale and 
Currency classes handles a variety of identifier mappings, while the 
ResourceBundle class handles loading of localized data with fallbacks [3]. In 
the Yahoo User Interface library, the Intl module does language negotiation and 
collaborates with the YUI loader in loading localized data [4]. I'd suggest 
separating similar functionality in LocaleInfo from the formatting and 
collation functionality and making it available to all. I suspect though that 
some of the current magic will turn out to be misguided when looked at in the 
clear light of a specification and will need to be discarded.

4. Language IDs in the library should be those of BCP 47, not of Unicode LDML. 
The two are similar, but there are subtle differences, as described in the LDML 
spec: LDML excludes some BCP 47 tags and subtags, adds a separator and the root 
locale, and changes the semantics of some tags [5]. Since BCP 47 is the 
dominant standard for language identification, internationalized applications 
have to support it. If an implementation of the internationalization API is 
based on LDML, it should handle the mapping from/to BCP 47 itself rather than 
burdening applications with it.

5. The specification mentions that a few Unicode extensions in BCP 47 (-u-ca-, 
-u-co-, can be used for specific purposes, but is silent on whether other 
extension are encouraged/allowed/ignored/illegal. This should be clarified.

6. Region IDs should be those of ISO 3166. The straw man references "LDML 
region subtags" instead; I haven't been able to find a definition of this term. 
If "ZZ" is really necessary for the API, then it should be called out directly 
in the API spec. But what information does "ZZ" convey that EcmaScript's 
"undefined" doesn't?

7. The priority list matching algorithm is not well specified. It doesn't seem 
to match the BCP 47 Lookup algorithm however [6], and I'd expect that algorithm 
to be available at least as a baseline (enhancements might be offered as well).

8. The specifications of NumberFormat and DateTimeFormat list several optional 
features: Support for scientific notation in NumberFormat; support for various 
styles and skeletons in DateTimeFormat. How can applications find out which of 
these optional features are supported by an actual implementation?

9. Currency formatting should require applications to explicitly specify the 
currency, using an ISO 4217 currency code, when constructing a currency number 
format. Currencies are really part of the value; they're not a presentation 
preference. Imagine a European e-commerce site calculating its prices in euro, 
but then displaying the values with the Korean won symbol just because the user 
configured his browser to send "Accept-Language: de-DE-u-cu-KRW" or 
""Accept-Language: de-KR"... [7].

10. Are the limits described for the NumberFormat parameters defaults or hard 
limits? It doesn't seem to make sense to impose hard limits such as "max 3 
fraction digits, min 0".

11. The description of the DateTimeFormat constructors refers to 
"LocaleInfo.prototype.numberFormat".

12. DateTimeFormat needs to provide a way for applications to specify the time 
zone, identified by a tz database identifier [8]. Browser-side code may need 
this capability to enforce a site-dependent time zone (e.g., a US financial 
site has to display quotes in New York City time), while server-side code may 
have to use the user's time zone. While it's possible to encode the time zone 
as part of a language ID (e.g., "en-AU-u-tz-auldh" to add Australia/Lord_Howe 
to Australian English), languages and time zones are really orthogonal concepts 
that should be kept separate, and the tz database identifiers are the most 
widely used identifiers for time zones.

13. DateTimeFormat also needs to let applications specify whether and how to 
include a time zone display name in the output. In CLDR, that's typically tied 
to the time style - long and full have the time zone, while short and medium 
don't. In reality, applications need to indicate the time zone to users if (and 
only if) it's not obvious from the context, and that's orthogonal to whether 
they want seconds.

14. There are a few additional DateTimeFormat skeletons that I think would be 
commonly used in applications:
- MMMdEEE, MMMMdEEEE: month, day, weekday in either abbreviated or full width; 
intended for dates in the current year.
- jmm: hour and minute, in 12-hour or 24-hour format as appropriate for the 
locale.
- jjjmmm: hour and minute, and if necessary am/pm, but with the appropriate 
characters for hour and minute rather than a colon in languages where that's 
commonly used, such as Chinese/Japanese/Korean: 오후 11시 5분. Falls back to jmm in 
other languages.
- z, zzzz: time zone names.
Other notes:
- yyyyMMMMd, "era only if necessary": should explain what that means, e.g., 
"era only for those calendars that need eras in order to uniquely identify all 
years after 1900".
- It must be possible to combine skeletons for date, time, and time zone (at 
most one each).

15. It seems that the correct handling of missing dateStyle or timeStyle 
parameters would be to omit the date or time from the formatted output.

16. DateTimeFormat.prototype.getAmPm is described as "array of eras". Beyond 
that typo, is this function really useful, given that many locales don't have 
am/pm strings, and LDML has deprecated the corresponding element?

17. Error handling needs to be specified in detail. I assume this will be done 
once the functionality is settled, so I won't go into much detail now. However, 
contrary to the current statement "invalid language ids or non-string elements 
should be ignored" (in priority lists), I think the library should throw errors 
for erroneous input. Language tags should at least be String objects and 
well-formed according to BCP 47 [9]. Similarly, an exception should be thrown 
if some value other than a Date object is passed into 
DateTimeFormat.prototype.format. Note that exceptions in EcmaScript do not 
oblige the direct caller to use try/catch - they're like unchecked exceptions 
in Java.

18. I know there has been a proposal for and discussion of MessageFormat 
functionality - is there a record of why it got removed from the strawman?


References:

[1] http://wiki.ecmascript.org/doku.php?id=strawman:i18n_api, version 
2011-07-01.
[2] http://code.google.com/p/google-caja/wiki/GlobalObjectPoisoning
[3] 
http://download.oracle.com/javase/6/docs/technotes/guides/intl/overview.html#locale
[4] http://developer.yahoo.com/yui/3/intl/
[5] http://unicode.org/reports/tr35/#Unicode_Language_and_Locale_Identifiers
[6] http://tools.ietf.org/html/rfc4647#section-3.4
[7] http://finance.yahoo.com/currency-converter/?amt=1&from=EUR&to=KRW
[8] http://www.twinsun.com/tz/tz-link.htm
[9] http://tools.ietf.org/html/rfc5646#section-2.2.9

Best regards,
Norbert

_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Comments on internationalization API

Reply via email to