Eryk Sun <eryk...@gmail.com> added the comment:
If normalize() is implemented for Windows, then the tests should be split out into POSIX and Windows versions. Currently, most of the tests in NormalizeTest are not checking a result that's properly normalized for ucrt. A useful implementation of locale.normalize should allow a script to use ("en_US", "iso8859_1") in Windows without having to know that Latin-1 is Windows codepage 28591, or that ucrt requires a classic locale name if the encoding isn't UTF-8. The required result for setlocale() is "English_United States.28591". As far as aliases are concerned, at a minimum, we need to map "posix" and "c" to "C". We can also support "C.UTF-8" as "en_US.UTF-8". Do we need to support the Unix locale_alias mappings from X.org? If so, I suppose we could use a double mapping. First try the Unix locale_alias mapping. Then try that result in a windows_locale_alias mapping that includes additional mappings from Unix to Windows. For example: sr_CS.UTF-8 -> sr_Cyrl_CS.UTF-8 sr_CS.UTF-8@latin -> sr_Latn_CS.UTF-8 ca_ES.UTF-8@valencia -> ca_ES_valencia.UTF-8 Note that the last one doesn't currently work. "ca-ES-valencia" is a valid Windows locale name for the Valencian variant of Catalan (ca), which lacks an ISO 639 code of its own since it's officially (and somewhat controversially) designated as a dialect of Catalan. This is an unusual case that has a subtag after the region, which ucrt's manual BCP-47 parsing cannot handle. (It tries to parse "ES" as the script and "valencia" as an ISO 3166-1 country code.) After mapping aliases, if the result still has "@" in it, normalize() should fail. We don't know what the "@" modifier means. Otherwise, split the locale name and encoding parts. If the encoding isn't UTF-8, try to map it to a codepage. For this we need a windows_codepage_alias dict that maps IANA official and Python-specific encoding names to Windows codepages. Next, check the locale name via WINAPI IsValidLocaleName. If it's not valid, try replacing underscore with hyphen and check again. Otherwise assume it's a classic ucrt locale name. (It may not be valid, but implementing all of the work ucrt does to parse a classic locale name is too much I think.) If it's a valid Windows locale name, and we have a codepage encoding, then try to translate it as a classic ucrt locale name. This requires two WINAPI GetLocaleInfoEx calls to look up the English versions of the language and country name. ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue37945> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com