New submission from Chris Angelico: >>> b"\xed\xb4\x80".decode("utf-8") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte
The actual problem here is that this byte sequence would decode to U+DD00, which, being a surrogate, is invalid for the encoding. It's correct to raise UnicodeDecodeError, but the text of the message is a bit obscure. I'm not sure whether the "invalid continuation byte" is talking about the "0xed in position 0" or about one of the others; 0xED is not a continuation byte, it's a start byte - and a perfectly valid one: >>> b"\xed\x9f\xbf".decode("utf-8") '\ud7ff' Pike is more explicit about what the problem is: > utf8_to_string("\xed\xb4\x80"); UTF-8 sequence beginning with 0xed 0xb4 at index 0 would decode to a UTF-16 surrogate character. Is this something worth fixing? Tested on 3.4.2 and a recent build of 3.5, probably applies to most 3.x versions. (2.7 actually permits this, which is a bigger bug, but one with backward-compatibility issues.) ---------- components: Interpreter Core, Unicode messages: 237572 nosy: Rosuav, ezio.melotti, haypo priority: normal severity: normal status: open title: Opaque error message on UTF-8 decoding to surrogates versions: Python 3.4, Python 3.5 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue23614> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com