Eryk Sun <eryk...@gmail.com> added the comment:

> I understand Python should be using reading the current CP (from 
> GetConsoleOutputCP
> or using the default OEM CP, and not assuming ANSI CP for stdio

A while ago I analyzed text encodings used by many of the legacy CLI programs 
in Windows. Some programs hard code using either the ANSI or OEM code page, and 
others use either the console's current input code page or its current output 
code page. In light of the inconsistencies, I think defaulting to ANSI for 
non-console standard I/O is fine.

> There's an IO codepage set on Windows consoles (`chcp` for cmd, 
> `[Console]::InputEncoding; [Console]::OutputEncoding` for PowerShell ;

The CMD shell is a Unicode (UTF-16) application, i.e. it calls wide-character 
system and console I/O functions such as ReadConsoleW() and WriteConsoleW(). It 
still uses the console output code page, but as a kind of locale encoding. For 
example, CMD uses the *output* code page when reading a batch file as well as 
when reading output from an external command in a `FOR /F` loop. If Python were 
only concerned with satisfying a `FOR /F` loop in CMD, then it would be 
reasonable to make stdout default to the console output code page. But 
"more.com" and "find.exe" are commonly used as well, and they decode piped 
input using the console *input* code page. Other commands such as "findstr.exe" 
use OEM.

PowerShell adds a spin to this problem. In CMD, piping bytes between two 
processes doesn't actively involve the shell. It just creates an anonymous 
pipe, with each process connected to either end. In contrast, PowerShell 
injects itself as a middle man. For example, piping between "python.exe" and 
"more.com" is implemented as a pipe from "python.exe" to PowerShell and a 
separate pipe from PowerShell to "more.com". In between, PowerShell decodes the 
output from "python.exe" using its current output encoding and then re-encodes 
it using its current input encoding before writing to "more.com".

> # If we adjust cmd CP, it's fine too:
> L:\Cop>chcp 1252
> Page de codes active : 1252
> L:\Cop>py testcp.py | more
> é

In this case, the ANSI code-page encoded output from Python is written to a 
pipe that's read directly by "more.com". In turn, "more.com" decodes the input 
bytes using the console input code page before writing UTF-16 text to the 
console via WriteConsoleW(). 

To make Python use the console input code page for standard I/O, query the code 
page via "chcp.com", and set PYTHONIOENCODING. For example:

    C:\>chcp
    Active code page: 437
    C:\>set PYTHONIOENCODING=cp437
    C:\>py -c "print('é')" | more
    é

It would be convenient to support encodings that are based on the current 
console code pages, maybe named "conin" and "conout", based on GetConsoleCP() 
and GetConsoleOutputCP(). For example:

    C:\>set PYTHONIOENCODING=conin

They could default to the process active code page from GetACP() when there's 
no console. "ansi" and "oem" are already supported, so all four of the common 
encoding abstractions would be supported.

> when there's redirection or piping, encoding falls back to ANSI CP 
> (from config_get_locale_encoding).

The default encoding for files is locale.getpreferredencoding(), unless UTF-8 
mode is enabled. In Windows, this is the process active code page, as returned 
by WinAPI GetACP(). By default, this is the system ANSI code page.

Standard I/O isn't excepted from this, unless either PYTHONIOENCODING is set or 
it's a console device file. The default, non-legacy behavior for console files 
is to use UTF-8 at the buffer and raw I/O level. Internally, Python uses the 
wide-character console I/O functions ReadConsoleW() and WriteConsoleW(), with 
UTF-16 encoded text.

Windows 10 allows setting the system ANSI code page to UTF-8. It also allows an 
application to override its active code page to UTF-8, but that's not easy to 
change. It requires adding an "activeCodePage" setting to the manifest that's 
embedded in the executable, which can be done using the manifest tool, "mt.exe".

----------
nosy: +eryksun

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue42707>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to