Eryk Sun <eryk...@gmail.com> added the comment:

If normalize() is implemented for Windows, then the tests should be split out 
into POSIX and Windows versions. Currently, most of the tests in NormalizeTest 
are not checking a result that's properly normalized for ucrt.

A useful implementation of locale.normalize should allow a script to use 
("en_US", "iso8859_1") in Windows without having to know that Latin-1 is 
Windows codepage 28591, or that ucrt requires a classic locale name if the 
encoding isn't UTF-8. The required result for setlocale() is "English_United 
States.28591". 

As far as aliases are concerned, at a minimum, we need to map "posix" and "c" 
to "C". We can also support "C.UTF-8" as "en_US.UTF-8". Do we need to support 
the Unix locale_alias mappings from X.org? If so, I suppose we could use a 
double mapping. First try the Unix locale_alias mapping. Then try that result 
in a windows_locale_alias mapping that includes additional mappings from Unix 
to Windows. For example: 

    sr_CS.UTF-8          -> sr_Cyrl_CS.UTF-8
    sr_CS.UTF-8@latin    -> sr_Latn_CS.UTF-8
    ca_ES.UTF-8@valencia -> ca_ES_valencia.UTF-8

Note that the last one doesn't currently work. "ca-ES-valencia" is a valid 
Windows locale name for the Valencian variant of Catalan (ca), which lacks an 
ISO 639 code of its own since it's officially (and somewhat controversially) 
designated as a dialect of Catalan. This is an unusual case that has a subtag 
after the region, which ucrt's manual BCP-47 parsing cannot handle. (It tries 
to parse "ES" as the script and "valencia" as an ISO 3166-1 country code.)

After mapping aliases, if the result still has "@" in it, normalize() should 
fail. We don't know what the "@" modifier means.

Otherwise, split the locale name and encoding parts. If the encoding isn't 
UTF-8, try to map it to a codepage. For this we need a  windows_codepage_alias 
dict that maps IANA official and Python-specific encoding names to Windows 
codepages. Next, check the locale name via WINAPI IsValidLocaleName. If it's 
not valid, try replacing underscore with hyphen and check again. Otherwise 
assume it's a classic ucrt locale name. (It may not be valid, but implementing 
all of the work ucrt does to parse a classic locale name is too much I think.) 
If it's a valid Windows locale name, and we have a codepage encoding, then try 
to translate it as a classic ucrt locale name. This requires two WINAPI 
GetLocaleInfoEx calls to look up the English versions of the language and 
country name.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue37945>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to