On 1/10/20, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote: > On Jan 10, 2020, at 03:45, Inada Naoki <songofaca...@gmail.com> wrote: > > Also, PYTHONUTF8 is only supported on Unix, so presumably it’s ignored if > you set it on Windows, right?
The implementation of UTF-8 mode (i.e. -Xutf8) is cross-platform, though I think it could use some tweaking for Windows. >> I believe UTF-8 should be chosen by default for text encoding. > > Correct me if I’m wrong, but I think in Python 3.7 on Windows 10, the > filesystem encoding is already UTF-8, and the stdio console files are UTF-8 > (but under the covers actually wrap the native UTF-16 console APIs instead > of using msvcrt stdio), so the only issue is the locale encoding, right? Yes, 3.6+ in Windows defaults to UTF-8 for console I/O and the filesystem encoding. If for some reason you need the legacy behavior, it can be enabled via the following environment variables [1]: PYTHONLEGACYWINDOWSSTDIO and PYTHONLEGACYWINDOWSFSENCODING. Setting PYTHONLEGACYWINDOWSFSENCODING switches the filesystem encoding to "mbcs". Note that this does not use the system MBS (multibyte string) API. Python simply transcodes between UTF-16 and ANSI instead of UTF-8. Currently this setting takes precedence over UTF-8 mode, but I think it should be the other way around. Setting PYTHONLEGACYWINDOWSSTDIO uses the console input codepage for stdin and the console output codepage for stdout and stderr, but only if isatty is true and the process is attached to a console (see _Py_device_encoding in Python/fileutils.c). Otherwise it uses the system ANSI codepage. Note that this setting is currently **broken** in 3.8. In Python/initconfig.c, config_init_stdio_encoding calls config_get_locale_encoding to set config->stdio_encoding. This always uses the system ANSI codepage (e.g. 1252), even for console files for which this choice makes no sense. Combining UTF-8 mode with legacy Windows standard I/O is generally dysfunctional. The result is mojibake, unless the console codepage happens to be UTF-8. I'd prefer UTF-8 mode to take precedence over legacy standard I/O mode and have it imply non-legacy I/O. In both of the above cases, what I'd prefer is for UTF-8 mode to take precedence over legacy modes, i.e. to disable config->legacy_windows_fs_encoding and config->legacy_windows_stdio in the startup configuration. Regarding the MBS API and UTF-8 In Windows 10, it's possible to set the ANSI and OEM codepages to UTF-8 at both the system level (in the system control panel) and the application level (in the application manifest). But many functions are still only available in the WCS (wide-character string) API, such as GetLocaleInfoEx, GetFileInformationByHandleEx, and SetFileInformationByHandle. I don't know whether Microsoft plans to implement MBS wrappers in these cases. If the ANSI codepage is UTF-8, then the MBS file API (e.g. CreateFileA) is basically equivalent to Python's UTF-8 filesystem encoding. There's one exception. Python uses the "surrogatepass" error handler, which allows invalid surrogate codes (i.e. a "Wobbly" WTF-8 encoding). In contrast, the MBS API translates invalid surrogates to the replacement character (U+FFFD). I think Python's choice is more sensible because the WCS file API (e.g. CreateFileW) and filesystem drivers do not verify that strings are valid Unicode. The console uses the system OEM codepage as its default I/O codepage. Setting OEM to UTF-8 (at the system level, not at the application level), or manually setting the codepage to UTF-8 via `chcp.com 65001`, is a potential problem because the console doesn't support reading non-ASCII UTF-8 strings via ReadFile or ReadConsoleA. Prior to Windows 10, it returns an empty string for this case, which looks like EOF. The new console in Windows 10 instead translates each non-ASCII character as a null byte (e.g. "SPĀM" -> "SP\x00M"), which is better but still pretty much useless for reading non-English input. Python 3.6+ is for the most part immune to this. In the default configuration, it uses ReadConsoleW to read UTF-16 instead of relying on the input codepage. (Low-level os.read is not immune to the problem, however, because it is not integrated with the new console I/O implementation.) [1] https://docs.python.org/3/using/cmdline.html#environment-variables _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/G2NOSM6EFOOO5WCLTCEWJ7DWS57DDZTY/ Code of Conduct: http://python.org/psf/codeofconduct/