Ezio Melotti added the comment:
The Table 3-7 of http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf (page 93
of the book, or 40 of the pdf) shows that if the start byte is ED the
continuation byte must be in range 80..9F. This means that, in order to decode
a sequence starting with ED, you need two more valid continuation bytes. Since
the following byte (B4) is not in allowed range 80..9F and is thus an invalid
continuation byte, the decoder doesn't know how to decode the byte in position
0 (i.e. ED).
It is also true that this particular sequence, if allowed, would result in a
surrogate. However, by looking at the first two bytes only, you don't have
enough information to be sure about that (e.g. ED B4 00 begins doesn't decode
to a surrogate, so Pike's error message is imprecise).
If handling this special case doesn't require too much extra code, it would be
ok with me to have something like:
>>> b"\xed\xb4\x80".decode("utf-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid
continuation byte (possible start of a surrogate)
----------
type: -> enhancement
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue23614>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com