[Python-ideas] Re: Recommend UTF-8 mode on Windows

Eryk Sun Sun, 12 Jan 2020 04:33:46 -0800

On 1/10/20, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
> On Jan 10, 2020, at 03:45, Inada Naoki <songofaca...@gmail.com> wrote:
>
> Also, PYTHONUTF8 is only supported on Unix, so presumably it’s ignored if
> you set it on Windows, right?

The implementation of UTF-8 mode (i.e. -Xutf8) is cross-platform,
though I think it could use some tweaking for Windows.

>> I believe UTF-8 should be chosen by default for text encoding.
>
> Correct me if I’m wrong, but I think in Python 3.7 on Windows 10, the
> filesystem encoding is already UTF-8, and the stdio console files are UTF-8
> (but under the covers actually wrap the native UTF-16 console APIs instead
> of using msvcrt stdio), so the only issue is the locale encoding, right?

Yes, 3.6+ in Windows defaults to UTF-8 for console I/O and the
filesystem encoding. If for some reason you need the legacy behavior,
it can be enabled via the following environment variables [1]:
PYTHONLEGACYWINDOWSSTDIO and PYTHONLEGACYWINDOWSFSENCODING.

Setting PYTHONLEGACYWINDOWSFSENCODING switches the filesystem encoding
to "mbcs". Note that this does not use the system MBS (multibyte
string) API. Python simply transcodes between UTF-16 and ANSI instead
of UTF-8. Currently this setting takes precedence over UTF-8 mode, but
I think it should be the other way around.

Setting PYTHONLEGACYWINDOWSSTDIO uses the console input codepage for
stdin and the console output codepage for stdout and stderr, but only
if isatty is true and the process is attached to a console (see
_Py_device_encoding in Python/fileutils.c). Otherwise it uses the
system ANSI codepage.

Note that this setting is currently **broken** in 3.8. In
Python/initconfig.c, config_init_stdio_encoding calls
config_get_locale_encoding to set config->stdio_encoding. This always
uses the system ANSI codepage (e.g. 1252), even for console files for
which this choice makes no sense.

Combining UTF-8 mode with legacy Windows standard I/O is generally
dysfunctional. The result is mojibake, unless the console codepage
happens to be UTF-8. I'd prefer UTF-8 mode to take precedence over
legacy standard I/O mode and have it imply non-legacy I/O.

In both of the above cases, what I'd prefer is for UTF-8 mode to take
precedence over legacy modes, i.e. to disable
config->legacy_windows_fs_encoding and config->legacy_windows_stdio in
the startup configuration.

Regarding the MBS API and UTF-8

In Windows 10, it's possible to set the ANSI and OEM codepages to
UTF-8 at both the system level (in the system control panel) and the
application level (in the application manifest). But many functions
are still only available in the WCS (wide-character string) API, such
as GetLocaleInfoEx, GetFileInformationByHandleEx, and
SetFileInformationByHandle. I don't know whether Microsoft plans to
implement MBS wrappers in these cases.

If the ANSI codepage is UTF-8, then the MBS file API (e.g.
CreateFileA) is basically equivalent to Python's UTF-8 filesystem
encoding. There's one exception. Python uses the "surrogatepass" error
handler, which allows invalid surrogate codes (i.e. a "Wobbly" WTF-8
encoding). In contrast, the MBS API translates invalid surrogates to
the replacement character (U+FFFD). I think Python's choice is more
sensible because the WCS file API (e.g. CreateFileW) and filesystem
drivers do not verify that strings are valid Unicode.

The console uses the system OEM codepage as its default I/O codepage.
Setting OEM to UTF-8 (at the system level, not at the application
level), or manually setting the codepage to UTF-8 via `chcp.com
65001`, is a potential problem because the console doesn't support
reading non-ASCII UTF-8 strings via ReadFile or ReadConsoleA. Prior to
Windows 10, it returns an empty string for this case, which looks like
EOF. The new console in Windows 10 instead translates each non-ASCII
character as a null byte (e.g. "SPĀM" -> "SP\x00M"), which is better
but still pretty much useless for reading non-English input. Python
3.6+ is for the most part immune to this. In the default
configuration, it uses ReadConsoleW to read UTF-16 instead of relying
on the input codepage. (Low-level os.read is not immune to the
problem, however, because it is not integrated with the new console
I/O implementation.)

[1] https://docs.python.org/3/using/cmdline.html#environment-variables
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/python-ideas@python.org/message/G2NOSM6EFOOO5WCLTCEWJ7DWS57DDZTY/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Recommend UTF-8 mode on Windows

Reply via email to