[Stdcxx Wiki] Update of "LocaleLookup" by TravisVitek

Apache Wiki Tue, 11 Mar 2008 11:49:34 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Stdcxx Wiki" for change 
notification.


The following page has been changed by TravisVitek:
http://wiki.apache.org/stdcxx/LocaleLookup

------------------------------------------------------------------------------
  
  If we look up the canonical name {{{es_BO.ISO-8859-1}}} we will see three 
possible locale names. If we look through our list of installed locales, we 
will find {{{es_BO}}}, but it would be wrong to return that locale because it 
doesn't actually match on this particular platform.
  
- So one solution for this might be to get the codeset name and store it in the 
mapping. This assumes that it is safe to request a locale using with the a 
codeset even though the list of installed locales didn't specify the codset.
+ Now we use the above data to figure out canonical name from local name, or 
vice-versa.
+ 
+ {{{
+   es_BO.8859-15 maps to local name es_BO.ISO-8859-15
+   es_BO         maps to local name es_BO.ISO-8859-15 or es_BO.ISO-8859-1
+ }}}
+ 
+ How do we know which {{{es_BO}}} is right for this platform?
+ 
+ One possible direction here is to ask a locale for its codeset. Unfortunately 
the returned string needs to be mapped to a canonical string. i.e. it might 
return {{{iso88591}}} on one platform, and {{{ISO-8859-1}}} on another.
+ 
+ If we need to ask a locale for its codeset and then use an additional mapping 
to get the canonical codeset name, then why not just provide lookups for each 
component of the canonical locale name and look them up individually?
+ 
+ We would need at least three different mappings. We would need four if we 
wanted to map from a language code to a default territory code. This would be 
necessary so that we can map locale names like {{{russian}}} or {{{ru}}} to an 
appropriate territory code.
+ 
+ {{{
+   # codeset mappings [one to many]
+   ISO-8859-1    8859-1 ISO8859-1
+   ISO-8859-15   8859-15 ISO8859-15
+   1252          CP-1252 IBM-1252
+   1254          CP-1254 IBM-1254
+ 
+   # language mappings [one to many]
+   en  English
+   es    Spanish
+   ab    Abkhazian abk
+   sq    Albanian alb sqi
+ 
+   # territory mappings [one to many]
+   US   "United States"
+   DE    Germany  
+ 
+   # default territory for language mappings [one to one]
+   ru RU
+   cs CZ
+ }}}
+ 
+ The advantage of this scheme over the previous scheme is that if we encounter 
a locale that we don't know, we might be able to get a valid canonical name for 
it. with the previous scheme, if we can't find a mapping for the name, then we 
just use the original name as the canonical name. If we did this, we would be 
able to build up a canonical name for it, and that would increase the chances 
of being able to use it.
  
  Another issue is that the data associated with each of the canonical locales, 
like {{{MB_CUR_LEN}}}, is different on each platform. The {{{ar_DZ.UTF-8}}} 
locale uses a 6 byte codeset on Linux, but a 4 byte codeset on other platforms.
  
- I think the solution for this would be to not store the MB_CUR_LEN value in 
the file, but capture it and append it to the canonical locale name when we 
enumerate the installed locales.
+ I think the logical solution for this would be to not store the 
{{{MB_CUR_LEN}}} value in the file, but capture it and append it to the 
canonical locale name when we enumerate the installed locales. See notes in 
Part3 about {{{MB_CUR_LEN}}}.
  
  [[Anchor(Part3)]]
  = Part 3 (STDCXX-716) =
@@ -116, +153 @@

  The proposed interface to all of this is a single public function named 
rw_query_locales(). The signature would be...
  
  {{{
-   char* rw_query_locales(const char* query, size_t count);
+   char* rw_query_locales (const char* query, size_t count);
  }}}
  
  The {{{query}}} parameter will be the query string. The {{{count}}} parameter 
is the maximum number of locales to return. This allows you to easily limit the 
number of locales tested.
  
- The expected format of the query string is similar to what is described 
above, except that the requested MB_CUR_LEN value will be expected to be part 
of the query string. The accepted MB_CUR_LEN value would be seperated from the 
canonical locale name expression with a period. An example query string...
+ The expected format of the query string is similar to what is described 
above, except that the requested {{{MB_CUR_LEN}}} value will be expected to be 
part of the query string. The accepted {{{MB_CUR_LEN}}} value would be 
seperated from the canonical locale name expression with a period. An example 
query string...
  
  {{{
-    "zh_*.*.{5..3} *_FR.*.1"
+    zh_*.*.{5..3} *_FR.*.1
  }}}
  
  This would match all 5, 4 and 3 byte encodings of the Chinese language in any 
country, then all 1 byte encodings for any language spoken in France.

[Stdcxx Wiki] Update of "LocaleLookup" by TravisVitek

Reply via email to