[Stdcxx Wiki] Update of "LocaleLookup" by TravisVitek

Apache Wiki Wed, 06 Feb 2008 13:06:55 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Stdcxx Wiki" for change 
notification.


The following page has been changed by TravisVitek:
http://wiki.apache.org/stdcxx/LocaleLookup

------------------------------------------------------------------------------
  [[Anchor(Definitions)]]
  = Definitions =
  
- canonical language code: The <language> field is two lowercase characters 
that represent the language as defined by [#References ISO-639].
+ '''canonical language code''': The {{{<language>}}} field is two lowercase 
characters that represent the language as defined by [#References ISO-639].
  
- canonical country code: The <COUNTRY> field is two uppercase letters that 
represent the country as defined by [#References ISO-3166].
+ '''canonical country code''': The {{{<COUNTRY>}}} field is two uppercase 
letters that represent the country as defined by [#References ISO-3166].
  
- canonical codeset code: The <CODESET> field is a string describing the 
encoding character set. For our purposes, the codeset is the preferred MIME 
name of the codeset as defined by [#References IANA].
+ '''canonical codeset code''': The {{{<CODESET>}}} field is a string 
describing the encoding character set. For our purposes, the codeset is the 
preferred MIME name of the codeset as defined by [#References IANA].
  
- canonical locale name: A complete locale name in the format 
<language>_<COUNTRY>.<CODESET>. Each field uses the canonical representation 
described above. [ex. en_US.ISO-8859-1]
+ '''canonical locale name''': A complete locale name in the format 
{{{<language>_<COUNTRY>.<CODESET>}}}. Each field uses the canonical 
representation described above. [ex. {{{en_US.ISO-8859-1}}}]
  
- native locale name: The locale name used by the local operating system. [ex. 
English_United States.1252, en]
+ '''native locale name''': The locale name used by the local operating system. 
[ex. {{{English_United States.1252}}}, {{{en}}}]
  
- locale locale name: See native locale name.
+ '''locale locale name''': See native locale name.
  
  [[Anchor(Plan)]]
  = Plan =
@@ -29, +29 @@

  
  Given a query string 
  
+ {{{
    {en,fr,*}_{CA,US,FR,CN}.*
+ }}}
  
  we would apply brace expansion to get the following list of expressions
  
+ {{{
    en_CA.*
    en_US.*
    en_FR.*
@@ -45, +48 @@

     *_US.*
     *_FR.*
     *_CN.*
+ }}}
  
  Once we have this list of expressions, we would enumerate all of the 
installed locales, and then search through them looking for locale names that 
match one of those regular expressions. The actual matching would be done using 
rw_fnmatch().
  
- Every platform has a unique list of locales available. For example, Windows 
sytems use 'English' as a language name, but most *nix systems the canonical 
'en' or in some cases 'EN'. This problem exists for the language, country and 
codeset fields of the locale name. To deal with this, we need to provide a 
mapping between the native names and the canonical names that we plan to use in 
the query string. It has been suggested that the mapping give a list of all 
known native locale names for each canonical locale name. The current 
suggestion is to provide one table with a list of all native locale names and 
the canonical names for all platforms. For efficiency, it was decided that this 
table include other information that may be useful such as MB_CUR_LEN for each 
of those locales.
+ Every platform has a unique list of locales available. For example, Windows 
sytems use {{{English}}} as a language name, but most *nix systems the 
canonical {{{en}}} or in some cases {{{EN}}}. This problem exists for the 
language, country and codeset fields of the locale name. To deal with this, we 
need to provide a mapping between the native names and the canonical names that 
we plan to use in the query string. It has been suggested that the mapping give 
a list of all known native locale names for each canonical locale name. The 
current suggestion is to provide one table with a list of all native locale 
names and the canonical names for all platforms. For efficiency, it was decided 
that this table include other information that may be useful such as 
{{{MB_CUR_LEN}}} for each of those locales.
  
  When we enumerate the list of installed locales we would use this data to map 
the locally installed locale name to the canonical locale name. For lookup 
purposes we use the canonical name, and once we've found a match, we provide 
the native locale name back to the user.
  
  [[Anchor(Issues)]]
  = Issues =
  
- Now that I'm collecting the list of installed locales to build up this table, 
I've noticed a few issues with the name mapping. One issue is that a single 
native locale name may map to a different canonical locale name on different 
platforms. For example, `es_BO' maps to `es_BO.ISO-8859-15' on AIX, but it maps 
to `es_BO.ISO-8859-1' on Linux and SunOS. Another issue is that the data 
associated with each of the canonical locales, like MB_CUR_LEN, is different on 
each platform. The ar_DZ.UTF-8 locale uses a 6 byte codeset on Linux, but a 4 
byte codeset on other platforms.
+ Now that I'm collecting the list of installed locales to build up this table, 
I've noticed a few issues with the name mapping. One issue is that a single 
native locale name may map to a different canonical locale name on different 
platforms. For example, {{{es_BO}}} maps to {{{es_BO.ISO-8859-15}}} on AIX, but 
it maps to {{{es_BO.ISO-8859-1}}} on Linux and SunOS. Another issue is that the 
data associated with each of the canonical locales, like {{{MB_CUR_LEN}}}, is 
different on each platform. The {{{ar_DZ.UTF-8}}} locale uses a 6 byte codeset 
on Linux, but a 4 byte codeset on other platforms.
  
  Options...
  
- I can provide one database per-platform that includes all of the locale 
information for that platform. I could write a utility to create this file for 
each platform. I could even opt to use this file as the list of installed 
locales instead of checking the output of `locale -a'. The disadvantage is that 
the data would have to be verified or completed manually to handle mapping 
native locales names like 'czech' to a canonical name. Maybe we could skip 
these? If so, then maybe we could generate this file on the fly before running 
any tests.
+ I can provide one database per-platform that includes all of the locale 
information for that platform. I could write a utility to create this file for 
each platform. I could even opt to use this file as the list of installed 
locales instead of checking the output of {{{locale -a}}}. The disadvantage is 
that the data would have to be verified or completed manually to handle mapping 
native locales names like {{{czech}}} to a canonical name. Maybe we could skip 
these? If so, then maybe we could generate this file on the fly before running 
any tests.
  
- Another option would be to have a seperate mapping for each of the locale 
name components. That makes it possible to from 'English' to 'en' or from 
'iso88591' to 'ISO-8859-1' so I can build up the complete canonical locale name 
with each of the canonical locale name components. The disadvantage with this 
is that I may have trouble mapping from locales names like 'czech' to a single 
canonical name. Maybe I should skip these?
+ Another option would be to have a seperate mapping for each of the locale 
name components. That makes it possible to from {{{English}}} to {{{en}}} or 
from {{{iso88591}}} to {{{ISO-8859-1}}} so I can build up the complete 
canonical locale name with each of the canonical locale name components. The 
disadvantage with this is that I may have trouble mapping from locales names 
like {{{czech}}} to a single canonical name. Maybe I should skip these?
  
  [[Anchor(References)]]
  = References =

[Stdcxx Wiki] Update of "LocaleLookup" by TravisVitek

Reply via email to