[Stdcxx Wiki] Update of "LocaleLookup" by TravisVitek

Apache Wiki Wed, 26 Mar 2008 22:01:16 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Stdcxx Wiki" for change 
notification.


The following page has been changed by TravisVitek:
http://wiki.apache.org/stdcxx/LocaleLookup

------------------------------------------------------------------------------
  
  || Test || Criteria ||
  || 22.LOCALE.CODECVT.MT.CPP || *1,+ ||
- || 22.LOCALE.CODECVT.OUT.CPP || *2 ||
+ ||<rowstyle="color:green"> 22.LOCALE.CODECVT.OUT.CPP || *10 ||
  || 22.LOCALE.CONS.MT.CPP || *1,+ ||
  || 22.LOCALE.CTYPE.CPP || *2 ||
  || 22.LOCALE.CTYPE.IS.CPP || *2 ||
@@ -34, +34 @@

  || 22.LOCALE.MONEY.PUT.MT.CPP || *1,+ ||
  || 22.LOCALE.MONEYPUNCT.CPP || *4 ||
  || 22.LOCALE.MONEYPUNCT.MT.CPP || *1,+ ||
- || 22.LOCALE.NUM.GET.CPP || *9 ||
+ ||<rowstyle="color:red"> 22.LOCALE.NUM.GET.CPP || *9 ||
  || 22.LOCALE.NUM.GET.MT.CPP || *1,+ ||
- || 22.LOCALE.NUM.PUT.CPP || *9 ||
+ ||<rowstyle="color:red"> 22.LOCALE.NUM.PUT.CPP || *9 ||
  || 22.LOCALE.NUM.PUT.MT.CPP || *1,+ ||
  || 22.LOCALE.NUMPUNCT.MT.CPP || *1,+ ||
  || 22.LOCALE.STATICS.MT.CPP || *4,+ ||
- || 22.LOCALE.TIME.GET.CPP || *5,6 ||
+ ||<rowstyle="color:green"> 22.LOCALE.TIME.GET.CPP || *5,6 ||
  || 22.LOCALE.TIME.GET.MT.CPP || *1,+ ||
  || 22.LOCALE.TIME.PUT.MT.CPP || *1,+ ||
  
- * Any locale for which setlocale (LC_ALL, name) will succeed.
+  1. Any locale for which setlocale (LC_ALL, name) will succeed.
- * Any locale for which setlocale (LC_CTYPE, name) will succeed.
+  1. Any locale for which setlocale (LC_CTYPE, name) will succeed.
- * Any locale for which setlocale (LC_NUMERIC, name) will succeed.
+  1. Any locale for which setlocale (LC_NUMERIC, name) will succeed.
- * All installed locales.
+  1. All installed locales.
- * First locale matching a specific name.
+  1. First locale matching a specific name.
- * First locale matching a regular expression.
+  1. First locale matching a regular expression.
- * First locale that is not an alias for the C/POSIX locale.
+  1. First locale that is not an alias for the C/POSIX locale.
- * Any locale for which setlocale (LC_ALL, name) will succeed, list includes 
C/POSIX locale.
+  1. Any locale for which setlocale (LC_ALL, name) will succeed, list includes 
C/POSIX locale.
- * Any locale for which setlocale (LC_NUMERIC, name) will succeed and 
decimal_point is not '.'
+  1. Any locale for which setlocale (LC_NUMERIC, name) will succeed and 
decimal_point is not '.'
+  1. Locale with largest MB_CUR_LEN value.
  + Test limits the number of locales tested.
  
  ||<rowstyle="color:red">Note: Most of the MT tests limit the number of 
locales to 32, so the test failure is not a matter of running against to many 
locales, it is an issue of running to many iterations per thread. The 
'solution' discussed in this document doesn't seem to address the actual 
problem for these tests.||
+ ||<rowstyle="color:red">Note: Most of the tests simply run against all 
locales that have a specified category. We need to decide how to further reduce 
the number of locales tested.||
  
  [[Anchor(Definitions)]]
  = Definitions =
@@ -71, +73 @@

  
  This page relates to the issue described in 
[http://issues.apache.org/jira/browse/STDCXX-608 STDCXX-608]. There has been 
some discussion both on and off the dev@ list about how to proceed. This page 
is here to document what has been discussed.
  
- The plan to meet the [#Objective Objective] is to provide an interface to 
query the set of installed locales based on a set of a small number of 
essential parameters used by the localization tests. The interface should make 
it easy to express conjunction, disjunction, and negation of the terms 
(parameters) and support (a perhaps simplified version of) 
[http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_03
 Basic Regular Expression] syntax. We've decided to use shell brace expansion 
as a means of expressing logical conjunction between terms: a valid brace 
expression is expanded to obtain a set of terms implicitly connected by a 
logical AND. Individual ('\n'-separated) lines of the query string are taken to 
be implicitly connected by a logical OR. This approach models the 
[http://www.opengroup.org/onlinepubs/009695399/utilities/grep.html grep] 
interface with each line loosely corresponding to the argument of the `-e` 
option to `grep`.
+ The plan to meet the [#Objective Objective] is to provide an interface to 
query the set of installed locales based on a set of a small number of 
essential parameters used by the localization tests. The interface should make 
it easy to express conjunction, disjunction, and negation of the terms 
(parameters) and support (a perhaps simplified version of) 
[http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_03
 Basic Regular Expression] syntax. We've decided to use shell brace expansion 
as a means of expressing logical conjunction between terms: a valid brace 
expression is expanded to obtain a set of terms implicitly connected by a 
logical AND. Individual ('\n'-separated) lines of the query string are taken to 
be implicitly connected by a logical OR. This approach models the 
[http://www.opengroup.org/onlinepubs/009695399/utilities/grep.html grep] 
interface with each line loosely corresponding to the argument of the {{{-e}}} 
option to {{{grep}}}.
  
  [[Anchor(Part1)]]
  = Part 1 (STDCXX-714) =
  
- The first thing that we needed was to write the function for doing Basic 
Regular Expression name matching and add it to the test suite.. Martin has 
already added an implementation of 
[http://svn.apache.org/viewvc/stdcxx/trunk/tests/src/fnmatch.cpp rw_fnmatch](), 
so that is done. `rw_fnmatch()` is a simplified implementation of the POSIX 
[http://www.opengroup.org/onlinepubs/009695399/functions/fnmatch.html fnmatch] 
function which supports a simplified and modified form of BRE used in filename 
globbing. This is sufficient for what we need in term of regular expression 
support.
+ The first thing that we needed was to write the function for doing Basic 
Regular Expression name matching and add it to the test suite.. Martin has 
already added an implementation of 
[http://svn.apache.org/viewvc/stdcxx/trunk/tests/src/fnmatch.cpp rw_fnmatch](), 
so that is done. {{{rw_fnmatch()}}} is a simplified implementation of the POSIX 
[http://www.opengroup.org/onlinepubs/009695399/functions/fnmatch.html 
fnmatch]() function which supports a simplified and modified form of BRE used 
in filename globbing. This is sufficient for what we need in term of regular 
expression support.
  
  The second thing that we needed was a function to do brace expansion. After 
much discussion, it was decided that the csh brace expansion rules made the 
most sense. Travis provided an implementation of a function for doing brace 
expansion. The function 
[http://svn.apache.org/viewvc/stdcxx/trunk/tests/src/braceexp.cpp 
rw_shell_expand]() does whitespace tokenization and collapse, and then does 
brace expansion on each token, much like the behavior you would see from the 
csh shell.
  
  Just for illustration, consider the following string.
  
  {{{
-    a-{1,2}-b
+ a-{1,2}-b
  }}}
  
- If you passed this to `rw_shell_expand()` (with ' ' as the seperator), the 
result would be
+ If you passed this to {{{rw_shell_expand()}}} (with ' ' as the seperator), 
the result would be
  
  {{{
-    a-1-b a-2-b
+ a-1-b a-2-b
  }}}
  
  [[Anchor(Part2)]]
@@ -102, +104 @@

  The format of these files is simple. Here is a grammar
  
  {{{
-   native-name-list ::= <native-name> | <native-name> ',' <native-name-list> | 
'\n' <ws> <native-name-list>
+ native-name-list ::= <native-name> | <native-name> ',' <native-name-list> | 
'\n' <ws> <native-name-list>
-   line         ::= '#' <comment> | <canonical-name> <native-name-list>
+ line         ::= '#' <comment> | <canonical-name> <native-name-list>
-   line-list    ::= <line> | <line> '\n' <line-list> 
+ line-list    ::= <line> | <line> '\n' <line-list> 
  }}}
  
  The grammar is comma delimited, so the strings are not to be quoted. Here is 
an example to illustrate.
  
  {{{
-   # this is a comment line
+ # this is a comment line
  
-    # _not_ a comment line
+  # _not_ a comment line
-   # the above maps '_not_ a comment line' to the value '#'
+ # the above maps '_not_ a comment line' to the value '#'
  
-   # map 'English' to 'en'
+ # map 'English' to 'en'
-   en  English
+ en    English
  
-   # map 'Albanian', 'alb' and 'sqi' to 'sq'
+ # map 'Albanian', 'alb' and 'sqi' to 'sq'
-   sq    Albanian, alb, sqi
+ sq    Albanian, alb, sqi
  
-   # similar to above, except that mapping is multiline
+ # similar to above, except that mapping is multiline
-   cu    Church Slavic, Old Slavonic, Church Slavonic,
+ cu    Church Slavic, Old Slavonic, Church Slavonic,
-         Old Bulgarian, Old Church Slavonic, chu
+       Old Bulgarian, Old Church Slavonic, chu
  }}}
  
  [[Anchor(Part3)]]
@@ -132, +134 @@

  The proposed interface to all of this is a single public function named 
rw_query_locales(). The signature would be...
  
  {{{
-   char* rw_query_locales (int loc_cat, const char* query, size_t count);
+ char* rw_query_locales (int loc_cat, const char* query, size_t count);
  }}}
  
  The {{{loc_cat}}} parameter is the locale category to get locales for, just 
like `rw_locales()` does in its current implementation. The {{{query}}} 
parameter will be the query string. The {{{count}}} parameter is the maximum 
number of locales to return. This allows you to easily limit the number of 
locales returned and eventually tested.
@@ -140, +142 @@

  The proposed grammar used by the query string is similar to what is used for 
the xfail.txt {{{config}}} string. It is a shell globbed string that has its 
terms joined with dashes.
  
  {{{
-   <match> is a shell globbing pattern in the format below. All fields 
+ <match> is a shell globbing pattern in the format below. All fields are 
required. 
-   are required. 
  
-   iso-country  ::= ISO-639-1 or ISO-639-2 two or three character country code 
+ iso-country  ::= ISO-639-1 or ISO-639-2 two or three character country code 
-   iso-language ::= ISO-3166 two character language code 
+ iso-language ::= ISO-3166 two character language code 
-   iana-codeset ::= IANA codeset name with '-' replaced or removed 
+ iana-codeset ::= IANA codeset name
  
-   match        ::= <iso-language-expr> '-' <iso-country-expr> '-' 
<mb_cur_len-expr> '-' <iana-codeset-expr>
+ match        ::= <iso-language-expr> '-' <iso-country-expr> '-' 
<mb_cur_len-expr> '-' <iana-codeset-expr>
-   match_list   ::= match | match ' ' match_list 
+ match_list   ::= match | match ' ' match_list 
  }}}
  
  So, given a query string 
  
  {{{
-   *-{CA,US}-1-{ISO-8859-1,UTF-8}
+ *-{CA,US}-1-{ISO-8859-1,UTF-8}
  }}}
  
  this function would internally apply brace expansion to get the following 
list of expressions
  
  {{{
-   *-CA-1-*-ISO-8859-1 *-CA-1-*-UTF-8 *-US-1-*-ISO-8859-1 *-US-1-*-UTF-8
+ *-CA-1-*-ISO-8859-1 *-CA-1-*-UTF-8 *-US-1-*-ISO-8859-1 *-US-1-*-UTF-8
  }}}
  
  ||<rowstyle="color:red"> /!\ Notice that I have moved the codeset to be the 
last match in the query string. That is because the codeset string is allowed 
to contain dashes. This was done to avoid issues with accidentally mistaking 
dashes in the codeset name with dashes in the grammar.||
@@ -169, +170 @@

  
  ||<rowstyle="color:red"> /!\ Perhaps we should consider adding an additional 
parameter to prepend the C/POSIX locales as there is no way to match them using 
the canonical locale name matching rules we've laid out above.||
  
- The buffer returned by `rw_locale_query()` is owned by that function and is 
not to be dallocated by the user. This buffer is currently planned to be left 
in use at program termination. If it is deemed necessary, some additional code 
can be written to cleanup the buffer before program exit, or we could require 
the user to deallocate the buffer when they are done with it.
+ The buffer returned by {{{rw_locale_query()}}} is owned by that function and 
is not to be dallocated by the user. This buffer is currently planned to be 
left in use at program termination. If it is deemed necessary, some additional 
code can be written to cleanup the buffer before program exit, or we could 
require the user to deallocate the buffer when they are done with it.
+ 
+ [[Anchor(Ideas)]]
+ = Ideas =
+ 
+ I'm wondering why we didn't decide to use a callback system for this. It 
would allow us to use arbitrary criteria to test a locale. The interface 
wouldn't always be 'grep-like', but it would be very extensible. Something like 
this...
+ 
+ {{{
+ _TEST_EXPORT const char*
+ rw_locale_language (const char*);
+ 
+ _TEST_EXPORT const char*
+ rw_locale_territory (const char*);
+ 
+ _TEST_EXPORT const char*
+ rw_locale_codeset (const char*);
+ 
+ _TEST_EXPORT void
+ rw_locale_test (bool (*fun)(const char*, void*), void*);
+ }}}
+ 
+ The function {{{rw_locale_test()}}} would get a list of all installed 
locales, then pass the name of those locales and the context pointer {{{p}}} to 
{{{fun}}}. The user function could do whatever it wanted to decide if the 
locale is acceptable.
+ 
+ This would make it quite simple to select only locales with a specific 
attribute. For example if we only wanted to select a locale with the largest 
MB_CUR_LEN value...
+ 
+ {{{
+ struct _locale_mb_context
+ {
+   char name [128];
+   int cur_len;
+ };
+ 
+ static bool
+ _rw_locale_mb_fun (const char* name, void* p)
+ {
+   const char* loc = setlocale (LC_CTYPE, name);
+   if (!loc)
+   {
+     _locale_mb_context* context =
+         (_locale_mb_context*)p;
+ 
+     const int cur_len = MB_CUR_LEN;  
+     if (context->cur_len < cur_len)
+     {
+       strcpy (context->name, loc);
+       context->cur_len = cur_len;
+     }
+   }
+ 
+   return false;
+ }
+ 
+ static const char*
+ test_big_mb_locale ()
+ {
+   locale_mb_context ctxt;
+   rw_locale_test (_rw_locale_mb_fun, &ctxt);
+ 
+   // run the test on locale named by ctxt.name
+ }
+ }}}
+ 
+ Or, to get a list of all locales that match brace expansion
+ 
+ {{{
+ static bool
+ _rw_locale_match (const char* name, void* p)
+ {
+   _locale_match_context* context =
+     (_locale_match_context*)p;
+ 
+   const char* language = rw_locale_language (name);
+   const char* country  = rw_locale_territory (name);
+   const char* codeset  = rw_locale_codeset (name);
+ 
+   char buf [128];
+   sprintf (buf, "%s-%s-%s", language, country, codeset);
+ 
+   for (const char* s = context->expr;
+        *s; s += strlen (s) + 1)
+   {
+     if (rw_fnmatch (s, name))
+     {
+       // run the test on locale named by name
+     }
+   }
+ 
+   return false;
+ }
+ 
+ static void
+ test_all_matches (const char* expr)
+ {
+   char buf [256];
+ 
+   char* res = rw_shell_expand (expr, 0, buf, sizeof (buf));
+  
+   _rw_locale_test (_rw_locale_match, res);
+ 
+   if (res != buf)
+     free (res);
+ }
+ }}}
  
  [[Anchor(References)]]
  = References =

[Stdcxx Wiki] Update of "LocaleLookup" by TravisVitek

Reply via email to