Eryk Sun <[email protected]> added the comment:
> On Windows 10 (version 1903), ANSI code page 1252, OEM code page 437,
> LC_CTYPE locale "French_France.1252"
The CRT default locale (i.e. the empty locale "") uses the user locale, which
is the "Format" value on the Region->Formats tab. It does not use the system
locale from the Region->Administrative tab.
The default locale normally uses the user locale's ANSI codepage, as returned
by GetLocaleInfoEx(LOCALE_NAME_USER_DEFAULT, LOCALE_IDEFAULTANSICODEPAGE, ...).
But if the active codepage of the process is UTF-8, then GetACP(), GetOEMCP(),
and setlocale(LC_CTYPE, "") all use UTF-8 (i.e. CP_UTF8, i.e. 65001). The
active codepage can be set to UTF-8 either at the system-locale level or in the
application-manifest. For example, with the active codepage setting in the
manifest:
C:\>python.utf8.exe -q
>>> from locale import setlocale, LC_CTYPE
>>> setlocale(LC_CTYPE, "")
'English_Canada.utf8'
>>> kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)
>>> kernel32.GetACP()
65001
>>> kernel32.GetOEMCP()
65001
A default locale name can also specify the codepage to use. It could be UTF-8,
a particular codepage, ".ACP" (ANSI), or ".OCP" (OEM). "ACP" and "OCP" have to
be in upper case. For example:
>>> setlocale(LC_CTYPE, '.utf8')
'English_Canada.utf8'
>>> setlocale(LC_CTYPE, '.437')
'English_Canada.437'
>>> setlocale(LC_CTYPE, ".ACP")
'English_Canada.1252'
>>> setlocale(LC_CTYPE, ".OCP")
'English_Canada.850'
Otherwise, if you provide a known locale -- using full names, or three-letter
abbreviations, or from the small set of locale aliases, then setlocale queries
any missing values from the NLS database.
One snag in the road is the set of Unicode-only locales, such as "Hindi_India".
Querying the ANSI and OEM codepages for a Unicode-only locale respectively
returns CP_ACP (0) and CP_OEMCP (1). It used to be that the CRT would end up
using the system locale for these cases. But recently ucrt has switched to
using UTF-8 for these cases. For example:
>>> setlocale(LC_CTYPE, "Hindi_India")
'Hindi_India.utf8'
That brings us to the case of modern Windows BCP-47 locale names, which usually
lack an implicit encoding. For example:
>>> setlocale(LC_CTYPE, "hi_IN")
'hi_IN'
The current CRT codepage can be queried via __lc_codepage_func:
>>> import ctypes; ucrt = ctypes.CDLL('ucrtbase', use_errno=True)
>>> ucrt.___lc_codepage_func()
65001
With the exception of Unicode-only locales, using a modern name without an
encoding defaults to the named locale's ANSI codepage. For example:
>>> setlocale(LC_CTYPE, "en_CA")
'en_CA'
>>> ucrt.___lc_codepage_func()
1252
The only encoding allowed in BCP-47 locale names is ".utf8" or ".utf-8" (case
insensitive):
>>> setlocale(LC_CTYPE, "fr_FR.utf8")
'fr_FR.utf8'
>>> setlocale(LC_CTYPE, "fr_FR.UTF-8")
'fr_FR.UTF-8'
No other encoding is allowed with this form. For example:
>>> try: setlocale(LC_CTYPE, "fr_FR.ACP")
... except Exception as e: print(e)
...
unsupported locale setting
>>> try: setlocale(LC_CTYPE, "fr_FR.1252")
... except Exception as e: print(e)
...
unsupported locale setting
As to the "tr_TR" locale bug, the Windows implementation is broken due to
assumptions that POSIX locale names are directly supported. A significant
redesign is required to connect the dots.
>>> from locale import getlocale
>>> setlocale(LC_CTYPE, 'tr_TR')
'tr_TR'
>>> ucrt.___lc_codepage_func()
1254
>>> getlocale(LC_CTYPE)
('tr_TR', 'ISO8859-9')
Codepage 1254 is similar to ISO8859-9, except, in typical fashion, Microsoft
assigned most of the upper control range 0x80-0x9F to an assortment of
characters it deemed useful, such as the Euro symbol "€". The exact codepage
needs to be queried via __lc_codepage_func() and returned as ('tr_TR',
'cp1254').
Conversely, setlocale() needs to know that this BCP-47 name does not support an
explicit encoding, unless it's "utf8". If the given codepage, or an associated
alias, doesn't match the locale's ANSI codepage, then the locale name has to be
expanded to the full name "Turkish_Turkey". The long name allows specifying an
arbitrary codepage.
For example, say we have ('tr_TR', 'ISO8859-7'), i.e. Greek with Turkish locale
rules. This transforms to the closest approximation ('tr_TR', '1253'). When
setlocale queries the OS, it will find that the ANSI codepage is actually 1254,
so it cannot use "tr_TR" or "tr-TR". It needs to expand to the long form:
>>> setlocale(LC_CTYPE, 'Turkish_Turkey.1253')
'Turkish_Turkey.1253'
----------
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue38324>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com