Johannes Berg <johan...@sipsolutions.net> added the comment:
Like I said above, it could be argued that the bug is in glibc, and then https://p.sipsolutions.net/6a4e9fce82dbbfa0.txt could be used as a simple LD_PRELOAD wrapper to work around this, just to illustrate the problem from that side. Arguably, that makes glibc in violation of RFC 3629, since it says: 3. UTF-8 definition [...] In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 accessible range) are encoded using sequences of 1 to 4 octets. [...] (hexadecimal) | (binary) --------------------+--------------------------------------------- 0000 0000-0000 007F | 0xxxxxxx 0000 0080-0000 07FF | 110xxxxx 10xxxxxx 0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx 0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx [...] Implementations of the decoding algorithm above MUST protect against decoding invalid sequences. [...] Here's a simple test program: https://p.sipsolutions.net/ac091b4ea4b7f742.txt ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue35883> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com