eryksun added the comment:
cp65001 was added in Python 3.3, for what it's worth. For me codepage 65001
(CP_UTF8) is broken for most console programs.
Windows API WriteFile gets routed to WriteConsoleA for a console buffer handle,
but WriteConsoleA has a different spec. It returns the number of wide
characters written instead of the number of bytes. Then WriteFile returns this
number without adjusting for the fact that 1 character != 1 byte. For example,
the following writes 5 bytes (3 wide characters), but WriteFile returns that
NumberOfBytesWritten is 3:
>>> import sys, msvcrt
>>> from ctypes import windll, c_uint, byref
>>> windll.kernel32.SetConsoleOutputCP(65001)
1
>>> h_out = msvcrt.get_osfhandle(sys.stdout.fileno())
>>> buf = '\u0100\u0101\n'.encode('utf-8')
>>> n = c_uint()
>>> windll.kernel32.WriteFile(h_out, buf, len(buf),
... byref(n), None)
Āā
1
>>> n.value
3
>>> len(buf)
5
There's a similar problem with ReadFile calling ReadConsoleA.
ANSICON (github.com/adoxa/ansicon) can hook WriteFile to fix this for select
programs. However, it doesn't hook ReadFile, so stdin.read remains broken.
> >>> import locale
> >>> locale.getpreferredencoding()
> 'cp1252'
The preferred encoding is based on the Windows locale codepage, which is
returned by kernel32!GetACP, i.e. the 'ANSI' codepage. If you want the console
codepages that were set at program startup, look at sys.stdin.encoding and
sys.stdout.encoding:
>>> windll.kernel32.SetConsoleCP(1252)
1
>>> windll.kernel32.SetConsoleOutputCP(65001)
1
>>> script = r'''
... import sys
... print(sys.stdin.encoding, sys.stdout.encoding)
... '''
>>> subprocess.call('py -3 -c "%s"' % script)
cp1252 cp65001
0
> >>> locale.getlocale()
> (None, None)
> >>> locale.getlocale(locale.LC_ALL)
> (None, None)
On most POSIX platforms nowadays, Py_Initialize sets the LC_CTYPE category to
its default value by calling setlocale(LC_CTYPE, "") in order to "obtain the
locale's charset without having to switch locales". On the other hand, the
bootstrapping process for Windows doesn't use the C runtime locale, so at
startup LC_CTYPE is still in the default "C" locale:
>>> locale.setlocale(locale.LC_CTYPE, None)
'C'
This in turn gets parsed into the (None, None) tuple that getlocale() returns:
>>> locale._parse_localename('C')
(None, None)
----------
nosy: +eryksun
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue21808>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com