Eryk Sun <[email protected]> added the comment:
local.normalize is generally wrong in Windows. It's meant for POSIX systems.
Currently "tr_TR" is parsed as follows:
>>> locale._parse_localename('tr_TR')
('tr_TR', 'ISO8859-9')
The encoding "ISO8859-9" is meaningless to Windows. Also, the old CRT only ever
supported either full language/country names or non-standard abbreviations --
e.g. either "Turkish_Turkey" or "trk_TUR". Having locale.getdefaultlocale()
return ISO two-letter codes (e.g. "en_GB") was fundamentally wrong for the old
CRT. (2.7 will die with this wart.)
3.5+ uses the Universal CRT, which does support standard ISO codes, but only in
BCP 47 [1] locale names of the following form:
language ISO 639
["-" script] ISO 15924
["-" region] ISO 3166-1
BCP 47 locale names have been preferred by Windows for the past 13 years, since
Vista was released. Windows extends BCP 47 with a non-standard sort-order field
(e.g. "de-Latn-DE_phoneb" is the German language with Latin script in the
region of Germany with phone-book sort order). Another departure from strict
BCP 47 in Windows is allowing underscore to be used as the delimiter instead of
hyphen.
In a concession to existing C code, the Universal CRT also supports an encoding
suffix in BCP 47 locales, but this can only be either ".utf-8" or ".utf8".
(Windows itself does not support specifying an encoding in a locale name, but
it's Unicode anyway.) No other encoding is allowed. If ".utf-8" isn't
specified, a BCP 47 locale defaults to the locale's ANSI codepage. However,
there's no way to convey this in the locale name itself. Also, if a locale is
Unicode only (e.g. Hindi), the CRT implicitly uses UTF-8 even without the
".utf-8" suffix.
The following are valid BCP 47 locale names in the CRT: "tr", "tr.utf-8",
"tr-TR", "tr_TR", "tr_TR.utf8", or "tr-Latn-TR.utf-8". But note that
"tr_TR.1254" is not supported.
The following shows that omitting the optional "utf-8" encoding in a BCP 47
locale makes the CRT default to the associated ANSI codepage.
>>> locale.setlocale(locale.LC_CTYPE, 'tr_TR')
'tr_TR'
>>> ucrt.___lc_codepage_func()
1254
C ___lc_codepage_func() queries the codepage of the current locale. We can
directly query this codepage for a BCP 47 locale via GetLocaleInfoEx:
>>> cpstr = (ctypes.c_wchar * 6)()
>>> kernel32.GetLocaleInfoEx('tr-TR',
... LOCALE_IDEFAULTANSICODEPAGE, cpstr, len(cpstr))
5
>>> cpstr.value
'1254'
If the result is '0', it's a Unicode-only locale (e.g. 'hi-IN' -- Hindi,
India). Recent versions of the CRT use UTF-8 (codepage 65001) for Unicode-only
locales:
>>> locale.setlocale(locale.LC_CTYPE, 'hi-IN')
'hi-IN'
>>> ucrt.___lc_codepage_func()
65001
Here are some example locale tuples that should be supported, given that the
CRT continues to support full English locale names and non-standard
abbreviations, in addition to the new BCP 47 names:
('tr', None)
('tr_TR', None)
('tr_Latn_TR, None)
('tr_TR', 'utf-8')
('trk_TUR', '1254')
('Turkish_Turkey', '1254')
The return value from C setlocale can be normalized to replace hyphen
delimiters with underscores, and "utf8" can be normalized as "utf-8". If it's a
BCP 47 locale that has no encoding, GetLocaleInfoEx can be called to query the
ANSI codepage. UTF-8 can be assumed if it's a Unicode-only locale.
As to prefixing a codepage with 'cp', we don't really need to do this. We have
aliases defined for most, such as '1252' -> 'cp1252'. But if the 'cp' prefix
does get added, then the locale module should at least know to remove it when
building a locale name from a tuple.
[1] https://tools.ietf.org/rfc/bcp/bcp47.txt
----------
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue37945>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com