[issue23614] Opaque error message on UTF-8 decoding to surrogates

Ezio Melotti Fri, 13 Mar 2015 10:59:16 -0700

Ezio Melotti added the comment:

The Table 3-7 of http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf (page 93 
of the book, or 40 of the pdf) shows that if the start byte is ED the 
continuation byte must be in range 80..9F.  This means that, in order to decode 
a sequence starting with ED, you need two more valid continuation bytes.  Since 
the following byte (B4) is not in allowed range 80..9F and is thus an invalid 
continuation byte, the decoder doesn't know how to decode the byte in position 
0 (i.e. ED).


It is also true that this particular sequence, if allowed, would result in a 
surrogate.  However, by looking at the first two bytes only, you don't have 
enough information to be sure about that (e.g. ED B4 00 begins doesn't decode 
to a surrogate, so Pike's error message is imprecise).

If handling this special case doesn't require too much extra code, it would be 
ok with me to have something like:
>>> b"\xed\xb4\x80".decode("utf-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid 
continuation byte (possible start of a surrogate)

----------
type:  -> enhancement

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue23614>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue23614] Opaque error message on UTF-8 decoding to surrogates

Reply via email to