Eryk Sun added the comment:
This a third-party problem due to bugs in the console's support for codepage
65001. For the general problem of Unicode in the console, see issue 1602. The
best way to resolve this problem is by using the wide-character APIs,
WriteConsoleW and ReadConsoleW. I suggest that you try the win_unicode_console
package.
> But if I try to print something a little less common
> (GREEK CAPITAL LETTER ALPHA), something weird happens:
>
> >python -c "print(chr(0x391))"
> Α
>
>
> >
In versions of Windows that use the legacy console, WriteFile to a console
screen mistakenly returns the number of UTF-16 codes written instead of the
number of bytes written.
For example, '\u0391\r\n' gets encoded as a four-byte buffer, b'\xce\x91\r\n'.
Here's the result of writing this buffer to the legacy console, using codepage
65001:
>>> sys.stdout.buffer.raw.write(b'\xce\x91\r\n')
Α
3
Four bytes were written, but the console returns that it wrote three UTF-16
codes. Python's BufferedWriter (i.e. sys.stdout.buffer) sees this as an
incomplete write. So it writes the last byte again. That's why you see an extra
newline. The problem can be far worse if the UTF-8 buffer contains many
non-ASCII characters, especially if it includes codes greater than U+07FF that
get encoded as three bytes.
This particular problem is fixed in the new version of the console that comes
with Windows 10. For the legacy console, you can work around the problem by
hooking WriteConsoleA and WriteFile via DLL injection. For example, ANSICON and
ConEmu do this.
That said, there's a far worse problem with using codepage 65001 in the
console, which still exists in Windows 10. Due to this bug Python's interactive
REPL will quit whenever you try to enter non-ASCII characters, and built-in
input() will raise EOFError. For example:
>>> input()
Ü
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
EOFError
To read the console's wide-character (UTF-16) input buffer via ReadFile, it has
to first get encoded to the current codepage. The console does the conversion
via WideCharToMultiByte with a buffer size that assumes each UTF-16 value will
be encoded as a single byte. But that's wrong for UTF-8, in which one UTF-16
code can map to as many as three bytes. So WideCharToMultiByte fails, but does
the console try to increase the buffer size? No. Does it fail the call? No. It
actually returns back that it 'successfully' read 0 bytes. To the REPL and
built-in input() that signals EOF (end of file).
If you only need to input text in your system locale, you can try to have the
best of both worlds. Use chcp.com to set the command prompt to the codepage you
need for input. Then in your Python script (e.g. in sitecustomize.py) you can
use ctypes to change just the output codepage and rebind sys.stdout. For
example:
>>> import os, sys, ctypes
>>> ctypes.WinDLL('kernel32').SetConsoleOutputCP(65001)
1
>>> sys.stdout = open(os.dup(sys.__stdout__.fileno()), 'w',
encoding='cp65001')
>>> sys.stdin.encoding
'cp1252'
>>> input()
Ü
'Ü'
>>> print('\u0391')
Α
Another minor bug is that the console doesn't keep an overlapping window in
case a UTF-8 sequence gets split across multiple writes (typically due to
buffering). For example:
>>> exec(r'''
... sys.stdout.buffer.raw.write(b'\xce')
... sys.stdout.buffer.raw.write(b'\x91')
... ''')
��>>>
Since UTF-8 uses up to four bytes per code, the console would have to keep a
three-byte buffer to handle the case of a split write.
> Look, guys, I know what a mess Unicode handling on Windows is,
> and I'm not even sure it's Python's fault
Unicode handling is only a mess in the Windows API if you think Unicode is
synonymous with UTF-8. Windows NT is Unicode down to the lowest levels of the
kernel, but it's UTF-16 using 16-bit wide characters. Part of the problem is
that the C and POSIX APIs that are preferred by cross-platform applications are
byte oriented (e.g. null-terminated char strings), so Unicode support becomes
synonymous with UTF-8. On Windows this leaves you stuck using the ANSI
codepage, which unfortunately cannot be set to codepage 65001. Microsoft would
have to rewrite a lot of code to support UTF-8 in the ANSI API, and they have
no incentive to pay for that given that they're heavily invested in UTF-16.
----------
nosy: +eryksun
resolution: -> third party
stage: -> resolved
status: open -> closed
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue26345>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com