Comments on locale name guideline

Markus Kuhn Wed, 06 Jun 2001 12:39:06 -0700
Comments on:

  Locale name guideline [Public Review Draft 2001-05-31]
  http://www.li18nux.org/docs/text/locale-name-20010531.txt

1) The field CODESET should remain optional and in the case of its
   absence, the implied value shall be "UTF-8".

   Rationale:

     - It can be hoped that in a few years, UTF-8 will have replaced
       most other encodings, and then it becomes possible to have (with perhaps
       a few exceptions) most locales in the same encoding, namely UTF-8.

     - Applications do not have to look at the locale name to determine the
       encoding, as the standard C library function nl_langinfo(CODESET) provides
       that information already
       <http://www.opengroup.org/onlinepubs/7908799/xsh/langinfo.h.html>

2) The field TERRITORY shall be optional and in the case of its absence
   either generic values for the monetary formatting shall be used, namely
   "�" as the currency symbol, "XXX" as the ISO 4217 currency code,
   and ISO 31-0 rules for formatting numbers. For languages that are
   as a first approximation spoken only in a single country (e.g. Japanese
   in Japan), alternatively the currency formatting rules of that country
   can be used.

   Rationale:

     - All possible combinations of language and currency would lead to
       a vast number of locales that would use a lot of disk space and
       make GUI locale selection menus huge and user-unfriendly. ISO has
       around 150 ISO 4217 currency codes, around 240 ISO 3166-1 country codes
       and around 140 ISO 639-1 language codes.

     - Unix applications and users hardly ever really use the monetary
       formatting rules. Most financial applications have data tagged with
       currency codes and use these with their own fixed locale-independent (!)
       formatting rules. This is recommended practice for safety reasons.

     - Most users therefore are very happy to just select a generic French,
       English or Spanish locale, without any territory-specific information. The
       availability of a generic English locale will significantly reduce
       the number of users hassling support about "I speak English but live in
       Denmark. Where do I find a Danish English locale then?". Only for very
       few languages do a few selected territories actually convey significant
       additional information beyond the practically useless currency
       information. Prime example is "en_US", which implies compared
       to "en" the use of North American paper formats, non-metric units,
       12h am/pm time notation and perhaps also US instead of Commonwealth
       spelling.

3) Under "TERRITORY" remove the sentence

     If an appropriate language name can not be represented by a 2 letter
     uppercase representation of ISO 3166-1, a 3 letter uppercase
     representation may be used.

   ISO 3166-1 defines both 2-letter and 3-letter codes for ALL countries.
   If there is no suitable ISO 3166-1 alpha-2 code, there will also be no
   suitable ISO 3166-1 alpha-3 code. In any case, it should say "country
   name" instead of "language name".

4) It would be nice if the CODESET and MIME Registry values were aligned.
   For that purpose, replace "MS-932" with "WINDOWS-932"

5) Under "MODIFIERS" specify that the OPTION fields shall be alphabetically
   sorted.

6) Under "MODIFIERS" write

     "euro"      for Euro currency (only for territories where EUR is planned
                 to be introduced but is not yet the only currency)

   Rationale:

     - This text makes clear that the "euro" field is only a temporary
       valid modifier for the euroland countries after 2001-03-01.

7) For "im=..." provide a list of input method names.

8) Add a list of references, including the full names of the latest editions
   of the quoted standards.

9) Add a special locale "C.UTF-8", which has exactly the same semantics as
   the "C" locale, except that the multi-byte encoding is guaranteed
   to be UTF-8, not just some superset of C's portable character set
   or ASCII. In particular, C.UTF-8 is *NOT* required to have case mappings
   and collating information for all of UCS. C.UTF-8 is important for
   preparing installations with very tight memory requirements, such as
   embedded systems booting from EPROM or bootdisks that use a UTF-8
   console. [Note: one day, when UTF-8 has become the almost exclusively
   used encoding, "C.UTF-8" will be renamed into "C".]

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/
Comments on locale name guideline

Reply via email to