[issue26345] Extra newline appended to UTF-8 strings on Windows

Eryk Sun Fri, 12 Feb 2016 02:42:38 -0800

Eryk Sun added the comment:

This a third-party problem due to bugs in the console's support for codepage 
65001. For the general problem of Unicode in the console, see issue 1602. The 
best way to resolve this problem is by using the wide-character APIs, 
WriteConsoleW and ReadConsoleW. I suggest that you try the win_unicode_console 
package.


> But if I try to print something a little less common
> (GREEK CAPITAL LETTER ALPHA), something weird happens:
>
>    >python -c "print(chr(0x391))"
>    Α
>
>
>    >

In versions of Windows that use the legacy console, WriteFile to a console 
screen mistakenly returns the number of UTF-16 codes written instead of the 
number of bytes written. 

For example, '\u0391\r\n' gets encoded as a four-byte buffer, b'\xce\x91\r\n'. 
Here's the result of writing this buffer to the legacy console, using codepage 
65001:

    >>> sys.stdout.buffer.raw.write(b'\xce\x91\r\n')
    Α
    3

Four bytes were written, but the console returns that it wrote three UTF-16 
codes. Python's BufferedWriter (i.e. sys.stdout.buffer) sees this as an 
incomplete write. So it writes the last byte again. That's why you see an extra 
newline. The problem can be far worse if the UTF-8 buffer contains many 
non-ASCII characters, especially if it includes codes greater than U+07FF that 
get encoded as three bytes. 

This particular problem is fixed in the new version of the console that comes 
with Windows 10. For the legacy console, you can work around the problem by 
hooking WriteConsoleA and WriteFile via DLL injection. For example, ANSICON and 
ConEmu do this.

That said, there's a far worse problem with using codepage 65001 in the 
console, which still exists in Windows 10. Due to this bug Python's interactive 
REPL will quit whenever you try to enter non-ASCII characters, and built-in 
input() will raise EOFError. For example:

    >>> input()
    Ü
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    EOFError

To read the console's wide-character (UTF-16) input buffer via ReadFile, it has 
to first get encoded to the current codepage. The console does the conversion 
via WideCharToMultiByte with a buffer size that assumes each UTF-16 value will 
be encoded as a single byte. But that's wrong for UTF-8, in which one UTF-16 
code can map to as many as three bytes. So WideCharToMultiByte fails, but does 
the console try to increase the buffer size? No. Does it fail the call? No. It 
actually returns back that it 'successfully' read 0 bytes. To the REPL and 
built-in input() that signals EOF (end of file).

If you only need to input text in your system locale, you can try to have the 
best of both worlds. Use chcp.com to set the command prompt to the codepage you 
need for input. Then in your Python script (e.g. in sitecustomize.py) you can 
use ctypes to change just the output codepage and rebind sys.stdout. For 
example:

    >>> import os, sys, ctypes
    >>> ctypes.WinDLL('kernel32').SetConsoleOutputCP(65001)
    1
    >>> sys.stdout = open(os.dup(sys.__stdout__.fileno()), 'w', 
encoding='cp65001')

    >>> sys.stdin.encoding
    'cp1252'
    >>> input()
    Ü
    'Ü'
    >>> print('\u0391')
    Α

Another minor bug is that the console doesn't keep an overlapping window in 
case a UTF-8 sequence gets split across multiple writes (typically due to 
buffering). For example:

    >>> exec(r'''
    ... sys.stdout.buffer.raw.write(b'\xce')
    ... sys.stdout.buffer.raw.write(b'\x91')
    ... ''')
    ��>>>

Since UTF-8 uses up to four bytes per code, the console would have to keep a 
three-byte buffer to handle the case of a split write.

> Look, guys, I know what a mess Unicode handling on Windows is,
> and I'm not even sure it's Python's fault 

Unicode handling is only a mess in the Windows API if you think Unicode is 
synonymous with UTF-8. Windows NT is Unicode down to the lowest levels of the 
kernel, but it's UTF-16 using 16-bit wide characters. Part of the problem is 
that the C and POSIX APIs that are preferred by cross-platform applications are 
byte oriented (e.g. null-terminated char strings), so Unicode support becomes 
synonymous with UTF-8. On Windows this leaves you stuck using the ANSI 
codepage, which unfortunately cannot be set to codepage 65001. Microsoft would 
have to rewrite a lot of code to support UTF-8 in the ANSI API, and they have 
no incentive to pay for that given that they're heavily invested in UTF-16.

----------
nosy: +eryksun
resolution:  -> third party
stage:  -> resolved
status: open -> closed

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26345>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue26345] Extra newline appended to UTF-8 strings on Windows

Reply via email to