Eryk Sun <[email protected]> added the comment:
We get into trouble with test_getsetlocale_issue1813 because normalize() maps
"tr_TR" (supported) to "tr_TR.ISO8859-9" (not supported).
>>> locale.normalize('tr_TR')
'tr_TR.ISO8859-9'
We should skip normalize() in Windows. It's based on a POSIX locale_alias
mapping that can only cause problems. The work for normalizing locale names in
Windows is best handled inline in _build_localename and _parse_localename.
For the old long form, C setlocale always returns the codepage encoding (e.g.
"Turkish_Turkey.1254") or "utf8", so that's simple to parse. For BCP 47
locales, the encoding is either "utf8" or "utf-8", or nothing at all. For the
latter, there's an implied legacy ANSI encoding. This is used by the CRT
wherever we depend on byte strings, such as in time.strftime:
mojibake:
>>> locale.setlocale(locale.LC_CTYPE, 'en_GB')
'en_GB'
>>> time.strftime("\u0100")
'A'
correct:
>>> locale.setlocale(locale.LC_CTYPE, 'en_GB.utf-8')
'en_GB.utf-8'
>>> time.strftime("\u0100")
'Ā'
(We should switch back to using wcsftime if possible.)
The implicit BCP-47 case can be parsed as `None` -- e.g. ("tr_TR", None).
However, it might be useful to support getting the ANSI codepage via
GetLocaleInfoEx [1]. A high-level function in locale could internally call
_locale.getlocaleinfo(locale_name, LOCALE_IDEFAULTANSICODEPAGE). This would
return a string such as "1254". or "0" for a Unicode-only language.
For _build_localename, we can't simply limit the encoding to UTF-8. We need to
support the old long/abbreviated forms (e.g. "trk_TUR", "turkish_Turkey") in
addition to the newer BCP 47 locale names. In the old form we have to support
the following encodings:
* codepage encodings, with an optional "cp" prefix that has
to be stripped, e.g. ("trk_TUR", "cp1254") -> "trk_TUR.1254"
* "ACP" in upper case only -- for the ANSI codepage of the
language
* "utf8" (mixed case) and "utf-8" (mixed case)
(The CRT documentation says "OEM" should also be supported, but it's not.)
A locale name can also omit the language in the old form -- e.g. (None, "ACP")
or (None, "cp1254"). The CRT uses the current language in this case. This is
discouraged because the result may be nonsense.
[1]
https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getlocaleinfoex
----------
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue37945>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com