New submission from Alexandre <alexandre+pyt...@13x.fr>: Hello,
first of all, I hope this was not already discussed (I searched the bugs but it might have been discussed elsewhere) and it's really a bug. I've been struggling to understand today why a simple file redirection couldn't work properly today (encoding issues) and I think I finally understand the whole thing. There's an IO codepage set on Windows consoles (`chcp` for cmd, `[Console]::InputEncoding; [Console]::OutputEncoding` for PowerShell ; chcp will not work on Powershell while it displays it set the CP), 850 for my locale. When there's no redirection / piping, PyWindowsConsoleIO take cares of the encoding (utf-8 is seems), but when there's redirection or piping, encoding falls back to ANSI CP (from config_get_locale_encoding). This behavior seems to be incorrect / breaking things, an example: * testcp.py (file encoded as utf-8) ``` #!/usr/bin/env python3 # -*- coding: utf-8 print('é') ``` * using cmd: ``` # Test condition L:\Cop>chcp Page de codes active : 850 # We're fine here L:\Cop>py testcp.py é L:\Cop>py -c "import sys; print(sys.stdout.encoding)" utf-8 # Now with piping L:\Cop>py -c "import sys; print(sys.stdout.encoding)" | more cp1252 L:\Cop>py testcp.py | more Ú L:\Cop>py testcp.py > lol && type lol Ú # If we adjust cmd CP, it's fine too: L:\Cop>chcp 1252 Page de codes active : 1252 L:\Cop>py testcp.py | more é ``` * with pwsh: ``` PS L:\Cop> ([Console]::InputEncoding, [Console]::OutputEncoding) | select CodePage CodePage -------- 850 850 # Fine without redirection PS L:\Cop> py .\testcp.py é # Here, write-host expect cp850 PS L:\Cop> py .\testcp.py | write-output Ú # Same with Out-file (used by ">") PS L:\Cop> py .\testcp.py > lol; Get-Content lol Ú # PS L:\Cop> py .\testcp.py | more ├Ü ``` By reading some sources today to solve my issue, I found many solutions: * in PS `[Console]::OutputEncoding = [Text.Utf8Encoding]::new($false); $env:PYTHONIOENCODING="utf8"` or `[Console]::OutputEncoding = [Text.Encoding]::GetEncoding(1252)` * in CMD `chcp 65001 && set PYTHONIOENCODING=utf8` (but this seems to break more) or `chcp 1252` But reading (and trusting) https://serverfault.com/questions/80635/how-can-i-manually-determine-the-codepage-and-locale-of-the-current-os (https://docs.microsoft.com/en-us/windows/win32/intl/locale-idefault-constants), I understand Python should be using reading the current CP (from GetConsoleOutputCP, like https://github.com/python/cpython/blob/3.9/Python/fileutils.c:) or using the default OEM CP, and not assuming ANSI CP for stdio : > * the OEM code page for use by legacy console applications, > * the ANSI code page for use by legacy GUI applications. The init path I could trace : > https://github.com/python/cpython/blob/3.9/Python/pylifecycle.c > init_sys_streams >> create_stdio >> (https://github.com/python/cpython/blob/3.9/Python/pylifecycle.c#L1774) >>> open.raw : >>> https://github.com/python/cpython/blob/3.9/Modules/_io/_iomodule.c#L374 >>>> https://github.com/python/cpython/blob/3.9/Modules/_io/winconsoleio.c >> fallback to ini_sys_stream encoding > https://github.com/python/cpython/blob/3.9/Python/initconfig.c > config_init_stdio_encoding > config_get_locale_encoding > GetACP() Some test with GetConsoleCP: ``` L:\Cop>py -c "import os; print(os.device_encoding(0), os.device_encoding(1))" | more cp850 None L:\Cop>type nul | py -c "import os; print(os.device_encoding(0), os.device_encoding(1))" None cp850 L:\Cop>type nul | py -c "import ctypes; print(ctypes.windll.kernel32.GetConsoleCP(), ctypes.windll.kernel32.GetConsoleOutputCP())" 850 850 L:\Cop>py -c "import ctypes; print(ctypes.windll.kernel32.GetConsoleCP(), ctypes.windll.kernel32.GetConsoleOutputCP())" | more 850 850 ``` Some links / documentations, if useful: * https://serverfault.com/questions/80635/how-can-i-manually-determine-the-codepage-and-locale-of-the-current-os * https://docs.microsoft.com/en-us/windows/win32/intl/locale-idefault-constants * https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getoemcp * https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp * https://docs.microsoft.com/en-us/windows/console/getconsoleoutputcp * https://stackoverflow.com/questions/56944301/why-does-powershell-redirection-change-the-formatting-of-the-text-content * https://stackoverflow.com/questions/19122755/output-echo-a-variable-to-a-text-file * https://stackoverflow.com/questions/40098771/changing-powershells-default-output-encoding-to-utf-8 * Maybe related: https://github.com/PowerShell/PowerShell/issues/10907 * https://stackoverflow.com/questions/57131654/using-utf-8-encoding-chcp-65001-in-command-prompt-windows-powershell-window (will probably break things :) ) * https://stackoverflow.com/questions/49476326/displaying-unicode-in-powershell/49481797#49481797 * https://stackoverflow.com/questions/25642746/how-do-i-pipe-unicode-into-a-native-application-in-powershell Please note I took time to write this issue as best as I could, I hope it won't be closed without explaining why the current behavior is normal (not that I suppose this will happen, I just don't know how people react here :) ). Thanks a lot for Python, I really enjoy using it, Best, Alexandre ---------- components: Windows messages: 383550 nosy: paul.moore, steve.dower, tim.golden, u36959, zach.ware priority: normal severity: normal status: open title: Python uses ANSI CP for stdio on Windows console instead of using console or OEM CP type: behavior _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue42707> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com