Comments on:
Locale name guideline [Public Review Draft 2001-05-31]
http://www.li18nux.org/docs/text/locale-name-20010531.txt
1) The field CODESET should remain optional and in the case of its
absence, the implied value shall be "UTF-8".
Rationale:
- It can be hoped that in a few years, UTF-8 will have replaced
most other encodings, and then it becomes possible to have (with perhaps
a few exceptions) most locales in the same encoding, namely UTF-8.
- Applications do not have to look at the locale name to determine the
encoding, as the standard C library function nl_langinfo(CODESET) provides
that information already
<http://www.opengroup.org/onlinepubs/7908799/xsh/langinfo.h.html>
2) The field TERRITORY shall be optional and in the case of its absence
either generic values for the monetary formatting shall be used, namely
"�" as the currency symbol, "XXX" as the ISO 4217 currency code,
and ISO 31-0 rules for formatting numbers. For languages that are
as a first approximation spoken only in a single country (e.g. Japanese
in Japan), alternatively the currency formatting rules of that country
can be used.
Rationale:
- All possible combinations of language and currency would lead to
a vast number of locales that would use a lot of disk space and
make GUI locale selection menus huge and user-unfriendly. ISO has
around 150 ISO 4217 currency codes, around 240 ISO 3166-1 country codes
and around 140 ISO 639-1 language codes.
- Unix applications and users hardly ever really use the monetary
formatting rules. Most financial applications have data tagged with
currency codes and use these with their own fixed locale-independent (!)
formatting rules. This is recommended practice for safety reasons.
- Most users therefore are very happy to just select a generic French,
English or Spanish locale, without any territory-specific information. The
availability of a generic English locale will significantly reduce
the number of users hassling support about "I speak English but live in
Denmark. Where do I find a Danish English locale then?". Only for very
few languages do a few selected territories actually convey significant
additional information beyond the practically useless currency
information. Prime example is "en_US", which implies compared
to "en" the use of North American paper formats, non-metric units,
12h am/pm time notation and perhaps also US instead of Commonwealth
spelling.
3) Under "TERRITORY" remove the sentence
If an appropriate language name can not be represented by a 2 letter
uppercase representation of ISO 3166-1, a 3 letter uppercase
representation may be used.
ISO 3166-1 defines both 2-letter and 3-letter codes for ALL countries.
If there is no suitable ISO 3166-1 alpha-2 code, there will also be no
suitable ISO 3166-1 alpha-3 code. In any case, it should say "country
name" instead of "language name".
4) It would be nice if the CODESET and MIME Registry values were aligned.
For that purpose, replace "MS-932" with "WINDOWS-932"
5) Under "MODIFIERS" specify that the OPTION fields shall be alphabetically
sorted.
6) Under "MODIFIERS" write
"euro" for Euro currency (only for territories where EUR is planned
to be introduced but is not yet the only currency)
Rationale:
- This text makes clear that the "euro" field is only a temporary
valid modifier for the euroland countries after 2001-03-01.
7) For "im=..." provide a list of input method names.
8) Add a list of references, including the full names of the latest editions
of the quoted standards.
9) Add a special locale "C.UTF-8", which has exactly the same semantics as
the "C" locale, except that the multi-byte encoding is guaranteed
to be UTF-8, not just some superset of C's portable character set
or ASCII. In particular, C.UTF-8 is *NOT* required to have case mappings
and collating information for all of UCS. C.UTF-8 is important for
preparing installations with very tight memory requirements, such as
embedded systems booting from EPROM or bootdisks that use a UTF-8
console. [Note: one day, when UTF-8 has become the almost exclusively
used encoding, "C.UTF-8" will be renamed into "C".]
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/