[issue21808] 65001 code page not supported

eryksun Thu, 19 Jun 2014 06:06:59 -0700

eryksun added the comment:

cp65001 was added in Python 3.3, for what it's worth. For me codepage 65001 
(CP_UTF8) is broken for most console programs.


Windows API WriteFile gets routed to WriteConsoleA for a console buffer handle, 
but WriteConsoleA has a different spec. It returns the number of wide 
characters written instead of the number of bytes. Then WriteFile returns this 
number without adjusting for the fact that 1 character != 1 byte. For example, 
the following writes 5 bytes (3 wide characters), but WriteFile returns that 
NumberOfBytesWritten is 3:

    >>> import sys, msvcrt 
    >>> from ctypes import windll, c_uint, byref

    >>> windll.kernel32.SetConsoleOutputCP(65001)
    1

    >>> h_out = msvcrt.get_osfhandle(sys.stdout.fileno())
    >>> buf = '\u0100\u0101\n'.encode('utf-8')
    >>> n = c_uint()
    >>> windll.kernel32.WriteFile(h_out, buf, len(buf),                
    ...                           byref(n), None)
    Āā
    1

    >>> n.value
    3
    >>> len(buf)
    5

There's a similar problem with ReadFile calling ReadConsoleA.

ANSICON (github.com/adoxa/ansicon) can hook WriteFile to fix this for select 
programs. However, it doesn't hook ReadFile, so stdin.read remains broken. 

>    >>> import locale
>    >>> locale.getpreferredencoding()
>    'cp1252'

The preferred encoding is based on the Windows locale codepage, which is 
returned by kernel32!GetACP, i.e. the 'ANSI' codepage. If you want the console 
codepages that were set at program startup, look at sys.stdin.encoding and 
sys.stdout.encoding:

    >>> windll.kernel32.SetConsoleCP(1252)       
    1
    >>> windll.kernel32.SetConsoleOutputCP(65001)
    1
    >>> script = r'''
    ... import sys
    ... print(sys.stdin.encoding, sys.stdout.encoding)
    ... '''

    >>> subprocess.call('py -3 -c "%s"' % script)
    cp1252 cp65001
    0

>    >>> locale.getlocale()
>    (None, None)
>    >>> locale.getlocale(locale.LC_ALL)
>    (None, None)

On most POSIX platforms nowadays, Py_Initialize sets the LC_CTYPE category to 
its default value by calling setlocale(LC_CTYPE, "") in order to "obtain the 
locale's charset without having to switch locales". On the other hand, the 
bootstrapping process for Windows doesn't use the C runtime locale, so at 
startup LC_CTYPE is still in the default "C" locale:

    >>> locale.setlocale(locale.LC_CTYPE, None)
    'C'

This in turn gets parsed into the (None, None) tuple that getlocale() returns:

    >>> locale._parse_localename('C')
    (None, None)

----------
nosy: +eryksun

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue21808>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21808] 65001 code page not supported

Reply via email to