[Stdcxx Wiki] Update of "LocaleLookup" by TravisVitek

Apache Wiki Mon, 10 Mar 2008 15:01:10 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Stdcxx Wiki" for change 
notification.


The following page has been changed by TravisVitek:
http://wiki.apache.org/stdcxx/LocaleLookup

------------------------------------------------------------------------------
- [[Anchor(Definitions)]]
- 
  = Problem Statement =
  
  Modern operating systems provide support for dozens or even hundreds locales 
encoded in various codesets. The set of locales and codesets installed on a 
computer is typically determined by the system administrator at the time the 
operating system is installed. Although there are standards and conventions in 
place to establish a common set of locale names, due to historical reasons both 
locale and codeset names tend to vary from one implementation to another. 
Operating systems may provide the standard names as well as the traditional 
ones, with the former simply being aliases for the latter.
@@ -12, +10 @@

  
  The objective of this project is to provide an interface to make it easy to 
write localization tests without the knowledge of platform-specific details 
that provide sufficient code coverage and that complete in a reasonable amount 
of time (ideally seconds as opposed to minutes). The interface must make it 
easy to query the system for locales that satisfy the specific requirements of 
each test. For example, most tests that currently use all installed locales 
(e.g., the set of tests for the `std::ctype` facet) only need to exercise a 
representative sample of the installed locales without using the same locale 
more than once. Thus the interface will need to make it possible to specify 
such a sample. Another example is tests that attempt to exercise locales in 
multibyte encodings whose `MB_CUR_MAX` ranges from 1 to 6 (some of the 
`std::codecvt` facet tests). The new interface will need to make it easy to 
specify such a set of locales without explicitly naming them, and it will 
 need to retrieve such locales without returning duplicates.
  
+ [[Anchor(Definitions)]]
  = Definitions =
  
  '''canonical language code''': The {{{<language>}}} field is two lowercase 
characters that represent the language as defined by [#References ISO-639].
@@ -24, +23 @@

  
  '''native locale name''': The locale name used by the local operating system. 
[ex. {{{English_United States.1252}}}, {{{en}}}]
  
- '''locale locale name''': See native locale name.
+ '''local locale name''': See native locale name.
  
  [[Anchor(Plan)]]
  = Plan =
  
  This page relates to the issue described at 
http://issues.apache.org/jira/browse/STDCXX-608. There has been some discussion 
both on and off the dev@ list about how to proceed. This page is here to 
document what has been discussed.
  
- The idea behind the issue is to create some mechanism for querying the list 
of installed locales, selecting those that match given criteria.
- 
- [[Anchor(2008_01_28)]]
- = Discussion 2008/01/28 =
- 
- The idea is to take a regular expression like query string, do a brace 
expansion to get several simpler regular expressions, and then search the list 
of installed locales for matches.
+ The plan is to take a regular expression like query string, do a brace 
expansion to get several simpler regular expressions, and then search the list 
of installed locales for matches.
  
  Given a query string 
  
@@ -63, +57 @@

  
  Once we have this list of expressions, we would enumerate all of the 
installed locales, and then search through them looking for locale names that 
match one of those regular expressions. The actual matching would be done using 
rw_fnmatch().
  
- Every platform has a unique list of locales available. For example, Windows 
sytems use {{{English}}} as a language name, but most *nix systems the 
canonical {{{en}}} or in some cases {{{EN}}}. This problem exists for the 
language, country and codeset fields of the locale name. To deal with this, we 
need to provide a mapping between the native names and the canonical names that 
we plan to use in the query string. It has been suggested that the mapping give 
a list of all known native locale names for each canonical locale name. The 
current suggestion is to provide one table with a list of all native locale 
names and the canonical names for all platforms. For efficiency, it was decided 
that this table include other information that may be useful such as 
{{{MB_CUR_LEN}}} for each of those locales.
- 
- When we enumerate the list of installed locales we would use this data to map 
the locally installed locale name to the canonical locale name. For lookup 
purposes we use the canonical name, and once we've found a match, we provide 
the native locale name back to the user.
- 
- [[Anchor(Issues)]]
- = Issues =
- 
- Now that I'm collecting the list of installed locales to build up this table, 
I've noticed a few issues with the name mapping. One issue is that a single 
native locale name may map to a different canonical locale name on different 
platforms. For example, {{{es_BO}}} maps to {{{es_BO.ISO-8859-15}}} on AIX, but 
it maps to {{{es_BO.ISO-8859-1}}} on Linux and SunOS. Another issue is that the 
data associated with each of the canonical locales, like {{{MB_CUR_LEN}}}, is 
different on each platform. The {{{ar_DZ.UTF-8}}} locale uses a 6 byte codeset 
on Linux, but a 4 byte codeset on other platforms.
- 
- Options...
- 
- I can provide one database per-platform that includes all of the locale 
information for that platform. I could write a utility to create this file for 
each platform. I could even opt to use this file as the list of installed 
locales instead of checking the output of {{{locale -a}}}. The disadvantage is 
that the data would have to be verified or completed manually to handle mapping 
native locales names like {{{czech}}} to a canonical name. Maybe we could skip 
these? If so, then maybe we could generate this file on the fly before running 
any tests.
- 
- Another option would be to have a seperate mapping for each of the locale 
name components. That makes it possible to from {{{English}}} to {{{en}}} or 
from {{{iso88591}}} to {{{ISO-8859-1}}} so I can build up the complete 
canonical locale name with each of the canonical locale name components. The 
disadvantage with this is that I may have trouble mapping from locales names 
like {{{czech}}} to a single canonical name. Maybe I should skip these?
  
  [[Anchor(References)]]
  = References =

[Stdcxx Wiki] Update of "LocaleLookup" by TravisVitek

Reply via email to