New submission from Chris Angelico:

>>> b"\xed\xb4\x80".decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position
0: invalid continuation byte

The actual problem here is that this byte sequence would decode to U+DD00, 
which, being a surrogate, is invalid for the encoding. It's correct to raise 
UnicodeDecodeError, but the text of the message is a bit obscure. I'm not sure 
whether the "invalid continuation byte" is talking about the "0xed in position 
0" or about one of the others; 0xED is not a continuation byte, it's a start 
byte - and a perfectly valid one:

>>> b"\xed\x9f\xbf".decode("utf-8")
'\ud7ff'

Pike is more explicit about what the problem is:

> utf8_to_string("\xed\xb4\x80");
UTF-8 sequence beginning with 0xed 0xb4 at index 0 would decode to a
UTF-16 surrogate character.

Is this something worth fixing?

Tested on 3.4.2 and a recent build of 3.5, probably applies to most 3.x 
versions. (2.7 actually permits this, which is a bigger bug, but one with 
backward-compatibility issues.)

----------
components: Interpreter Core, Unicode
messages: 237572
nosy: Rosuav, ezio.melotti, haypo
priority: normal
severity: normal
status: open
title: Opaque error message on UTF-8 decoding to surrogates
versions: Python 3.4, Python 3.5

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue23614>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to