Eryk Sun <[email protected]> added the comment:
> I understand Python should be using reading the current CP (from
> GetConsoleOutputCP
> or using the default OEM CP, and not assuming ANSI CP for stdio
A while ago I analyzed text encodings used by many of the legacy CLI programs
in Windows. Some programs hard code using either the ANSI or OEM code page, and
others use either the console's current input code page or its current output
code page. In light of the inconsistencies, I think defaulting to ANSI for
non-console standard I/O is fine.
> There's an IO codepage set on Windows consoles (`chcp` for cmd,
> `[Console]::InputEncoding; [Console]::OutputEncoding` for PowerShell ;
The CMD shell is a Unicode (UTF-16) application, i.e. it calls wide-character
system and console I/O functions such as ReadConsoleW() and WriteConsoleW(). It
still uses the console output code page, but as a kind of locale encoding. For
example, CMD uses the *output* code page when reading a batch file as well as
when reading output from an external command in a `FOR /F` loop. If Python were
only concerned with satisfying a `FOR /F` loop in CMD, then it would be
reasonable to make stdout default to the console output code page. But
"more.com" and "find.exe" are commonly used as well, and they decode piped
input using the console *input* code page. Other commands such as "findstr.exe"
use OEM.
PowerShell adds a spin to this problem. In CMD, piping bytes between two
processes doesn't actively involve the shell. It just creates an anonymous
pipe, with each process connected to either end. In contrast, PowerShell
injects itself as a middle man. For example, piping between "python.exe" and
"more.com" is implemented as a pipe from "python.exe" to PowerShell and a
separate pipe from PowerShell to "more.com". In between, PowerShell decodes the
output from "python.exe" using its current output encoding and then re-encodes
it using its current input encoding before writing to "more.com".
> # If we adjust cmd CP, it's fine too:
> L:\Cop>chcp 1252
> Page de codes active : 1252
> L:\Cop>py testcp.py | more
> é
In this case, the ANSI code-page encoded output from Python is written to a
pipe that's read directly by "more.com". In turn, "more.com" decodes the input
bytes using the console input code page before writing UTF-16 text to the
console via WriteConsoleW().
To make Python use the console input code page for standard I/O, query the code
page via "chcp.com", and set PYTHONIOENCODING. For example:
C:\>chcp
Active code page: 437
C:\>set PYTHONIOENCODING=cp437
C:\>py -c "print('é')" | more
é
It would be convenient to support encodings that are based on the current
console code pages, maybe named "conin" and "conout", based on GetConsoleCP()
and GetConsoleOutputCP(). For example:
C:\>set PYTHONIOENCODING=conin
They could default to the process active code page from GetACP() when there's
no console. "ansi" and "oem" are already supported, so all four of the common
encoding abstractions would be supported.
> when there's redirection or piping, encoding falls back to ANSI CP
> (from config_get_locale_encoding).
The default encoding for files is locale.getpreferredencoding(), unless UTF-8
mode is enabled. In Windows, this is the process active code page, as returned
by WinAPI GetACP(). By default, this is the system ANSI code page.
Standard I/O isn't excepted from this, unless either PYTHONIOENCODING is set or
it's a console device file. The default, non-legacy behavior for console files
is to use UTF-8 at the buffer and raw I/O level. Internally, Python uses the
wide-character console I/O functions ReadConsoleW() and WriteConsoleW(), with
UTF-16 encoded text.
Windows 10 allows setting the system ANSI code page to UTF-8. It also allows an
application to override its active code page to UTF-8, but that's not easy to
change. It requires adding an "activeCodePage" setting to the manifest that's
embedded in the executable, which can be done using the manifest tool, "mt.exe".
----------
nosy: +eryksun
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue42707>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com