Eryk Sun added the comment:

>> so ANSI is the natural default for a detached process
>
> To clarify - ANSI is the natural default *for programs that 
> don't support Unicode*.

By natural, I meant in the context of using GetConsoleOutputCP(), since 
WideCharToMultiByte(0, ...) encodes text as ANSI. Clearly UTF-16LE is preferred 
for IPC on Windows. It's the native Unicode format down to the lowest levels of 
the kernel. But we're talking about old-school IPC using standard I/O 
pipelines, for which I think UTF-8 is a better fit.

> Forcing the use of UTF-8 as the code page is the easiest way 
> for us to support it.

The console's behavior for codepage 65001 is too buggy. The show stopper is 
that it limits input to ASCII. The console allocates a temporary buffer for the 
encoded text that's sized assuming 1 ANSI/OEM byte per UTF-16 code. So if you 
enter non-ASCII characters, WideCharToMultiByte fails in conhost.exe. But the 
console returns that the operation has successfully read 0 bytes. Python's REPL 
and input() see this as EOF.

For example:

    import sys, ctypes, msvcrt
    kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)

    conin = open(r'\\.\CONIN$', 'r+')
    h = msvcrt.get_osfhandle(conin.fileno())
    buf = (ctypes.c_char * 15)()
    n = (ctypes.c_ulong * 1)()

    >>> sys.stdin.encoding
    'cp65001'

ReadFile test in Windows 10:

    >>> kernel32.ReadFile(h, buf, 15, n, None)
    Test!
    1
    >>> n[0], buf[:]
    (7, b'Test!\r\n\x00\x00\x00\x00\x00\x00\x00\x00')

    >>> kernel32.ReadFile(h, buf, 15, n, None)
    ¡Prueba!
    1
    >>> n[0], buf[:]
    (0, b'Test!\r\n\x00\x00\x00\x00\x00\x00\x00\x00')

The second call obviously fails, even thought it returns 1. The input contains 
non-ASCII "¡", which in UTF-8 requires 2 bytes, b'\xc2\xa1'. This causes the 
failure in conhost.exe that I described above.

ReadConsoleA has the same problem:

    >>> kernel32.ReadConsoleA(h, buf, 15, n, None)
    Hello World!
    1
    >>> n[0], buf[:]
    (14, b'Hello World!\r\n\x00')

    >>> kernel32.ReadConsoleA(h, buf, 15, n, None)
    ¡Hola Mundo!
    1
    >>> n[0], buf[:]
    (0, b'Hello World!\r\n\x00')

UTF-8 output is also buggy prior to Windows 8. The problem is that WriteFile 
returns the number of UTF-16 codes written instead of the number of bytes. For 
non-ASCII characters in the BMP, 1 UTF-16 code is 2 or 3 UTF-8 bytes. So it 
looks like a partial write. A buffered writer will loop multiple times to write 
what appears to be the remaining bytes, in a trail of junk lines in proportion 
to the number of non-ASCII characters written.

Python could work around this by decoding the buffer to get the corresponding 
number of UTF-16 codes written in the console, but child processes may also be 
subject to this bug. The only general solution on Windows 7 is to use something 
like ANSICON, which uses DLL injection to hook and wrap WriteFile and 
WriteConsoleA.

There's also a UTF-8 related bug in ulib.dll. This bug affects programs that do 
console codepage conversions, such as more.com. This in turn affects Python's 
interactive help(). I looked at this in issue 19914. The ulib bug is fixed in 
Windows 10. I don't know whether it's fixed in Windows 8, but it's there in 
Windows 7 (supported until 2020).

> This would make Python's implementation much more 
> complicated, as well as breaking some scripts and 
> existing packages.

Unless you're talking about major breakage, I think switching to the 
wide-character API is worth it, as the only viable path to supporting Unicode 
in the console. The implementation probably should transcode between UTF-16LE 
and UTF-8, so pure Python never sees UTF-16 byte strings. sys.std*.encoding 
would be 'utf-8'. os.read and os.write would be implemented as _Py_read and 
_Py_write (already exists). For console handles these could delegate to 
_Py_console_read and _Py_console_write, to convert between UTF-8 and UTF-16LE 
and call ReadConsoleW and WriteConsoleW.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue27179>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to