[issue37945] test_locale failing

Eryk Sun Thu, 29 Aug 2019 12:48:53 -0700


Eryk Sun <[email protected]> added the comment:


Here's some additional background information for work on this issue.

A Unix locale identifier has the following form:

    "language[_territory][.codeset][@modifier]"
        | "POSIX"
        | "C"
        | ""
        | NULL

(X/Open Portability Guide, Issue 4, 1992 -- aka XPG4)

Some systems also implement "C.UTF-8". 

The language and territory should use ISO 639 and ISO 3166 alpha-2 codes. The 
"@" modifier may indicate an alternate script such as "sr_RS@latin" or an 
alternate currency such as "de_DE@euro". For the optional codeset, IANA 
publishes the following table of character sets:

http://www.iana.org/assignments/character-sets/character-sets.xhtml

In Debian Linux, the available encodings are defined by mapping files in 
"/usr/share/i18n/charmaps". But encodings can't be arbitrarily used in locales 
at run time. A locale has to be generated (see "/etc/locale.gen") before it's 
available. 

A Windows (not ucrt) locale name has the following form:

    "ISO639Language[-ISO15924Script][-ISO3166Region][SubTag][_SortOrder]"
        | ""                      | LOCALE_NAME_INVARIANT
        | "!x-sys-default-locale" | LOCALE_NAME_SYSTEM_DEFAULT
        | NULL                    | LOCALE_NAME_USER_DEFAULT

The invariant locale provides stable data. The system and user default locales 
vary according to the Control Panel "Region" settings.

A locale name is based on BCP 47 language tags, with the form 
"<language>-<script>-<region>"(e.g. "en-Latn-GB"), for which the script and 
region codes are optional. The language is an ISO 639 alpha-2 or alpha-3 code, 
with alpha-2 preferred. The script is an initial-uppercase ISO 15924 code. The 
region is an ISO 3166-1 alpha-2 or numeric-3 code, with alpha-2 preferred. 

As specified, the sort-order code should be delimited by an underscore, but 
Windows 10 (maybe older versions also?) accepts a hyphen instead. Here's a list 
of the sort-order codes that I've seen:

    * mathan - Math Alphanumerics       ( x-IV_mathan)
    * phoneb - Phone Book               (de-DE_phoneb)
    * modern - Modern                   (ka-GE_modern)
    * tradnl - Traditional              (es-ES_tradnl)
    * technl - Technical                (hu-HU_technl)
    * radstr - Radical/Stroke           (ja-JP_radstr)
    * stroke - Stroke Count             (zh-CN_stroke)
    * pronun - Pronunciation (Bopomofo) (zh-TW_pronun)

One final note of interest about Windows locales is that the user-interface 
language has been functionally isolated from the locale. The display language 
is handled by the Multilinugual User Interface (MUI) API, which depends on .mui 
files in locale-named subdirectories of a binary, such as "kernel32.dll" -> 
"en-US\kernel32.dll.mui". Windows 10 has an option to configure the user locale 
to match the preferred display language. This helps to keep the two in sync, 
but they're still functionally independent.

The Universal CRT (ucrt) in Windows supports the following syntax for a locale 
identifier:

    "ISO639Language[-ISO15924Script][-ISO3166Region][.utf8|.utf-8]"
        | "ISO639Language[-ISO15924Script][-ISO3166Region][SubTag][_SortOrder]"
        | "language[_region][.codepage|.utf8|.utf-8]"
        | ".codepage" | ".utf8" | ".utf-8"
        | "C"
        | ""
        | NULL

NULL is used with setlocale to query the current value of a category. The empty 
string "" is the current-user locale. "C" is a minimal locale. For LC_CTYPE, 
"C" uses Latin-1, but for LC_TIME it uses the system ANSI codepage (possibly 
multi-byte), which can lead to mojibake. The "POSIX" locale is not supported, 
nor is "C.UTF-8". 

Note that UTF-8 support is relatively new, as is the ability to set the 
encoding without also specifying a region (e.g. "english.utf8").

Recent versions of ucrt extend BCP-47 support in a couple of ways. Underscore 
is allowed in addition to hyphen as the tag delimiter (e.g "en_GB" instead of 
"en-GB"), and specifying UTF-8 as the encoding (and only UTF-8) is supported. 
If UTF-8 isn't specified, internally the locale defaults to the language's ANSI 
codepage. ucrt has to parse BCP 47 locales manually if they include an 
encoding, and also in some cases when underscore is used. Currently this fails 
to handle a sort-order tag, so we can't use, for example, "de_DE_phoneb.utf8".

----------

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue37945>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue37945] test_locale failing

Reply via email to