[issue35883] Change invalid unicode characters to replacement characters in argv

Eryk Sun Fri, 01 Feb 2019 15:50:18 -0800


Eryk Sun <[email protected]> added the comment:


In Unix, Python 3.6 decodes the char * command line arguments via mbstowcs. In 
Linux, I see the following misbehavior of mbstowcs when decoding an overlong 
UTF-8 sequence:

    >>> mbstowcs = ctypes.CDLL(None, use_errno=True).mbstowcs
    >>> arg = bytes(x + 128 for x in [1 + 124, 63, 63, 59, 58, 58])
    >>> mbstowcs(None, arg, 0)
    1
    >>> buf = (ctypes.c_int * 2)()
    >>> mbstowcs(buf, arg, 2)
    1
    >>> hex(buf[0])
    '0x7fffbeba'

This shouldn't be an issue in 3.7, at least not with the default UTF-8 mode 
configuration. With this mode, Py_DecodeLocale calls _Py_DecodeUTF8Ex using the 
surrogateescape error handler [1].

[1]: https://github.com/python/cpython/blob/v3.7.2/Python/fileutils.c#L456

----------
nosy: +eryksun

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue35883>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue35883] Change invalid unicode characters to replacement characters in argv

Reply via email to