[issue23614] Opaque error message on UTF-8 decoding to surrogates

Ezio Melotti Fri, 13 Mar 2015 15:26:06 -0700

Ezio Melotti added the comment:

> Nice document. Is that actually how Python's decoder checks things?


Yes, Python follows the Unicode standard.

> * E0 followed by 80..9F: "non-shortest form"
> * ED followed by A0..BF: "surrogate"
> * F4 followed by 90..BF: "outside defined range"

If you get a decode error while using UTF-8, it means that you are trying to 
decode something that is not (valid) UTF-8.  I can see 3 situations where this 
might happen:
1) the input is using a different encoding;
2) the input is corrupted;
3) the input is using an encoding similar to UTF-8 (e.g. CESU-8);

In the first two cases additional information about continuation bytes are 
meaningless and misleading (there's no such thing as short form or surrogates 
in e.g. ASCII).  In the third case (which is actually a special case of 1), 
mentioning surrogates and perhaps non-shortest form might be useful if the 
developer is intimately familiar with UTF-8 and Unicode since he might suspect 
that the input is actually CESU-8 or the text has been encoded by an outdated 
encoder that follows the RFC 2044 specs from 1996.

> How does this look?
>
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0:
> invalid continuation byte 0xb4 for this start byte

Something similar would be ok with me, assuming is easy to implement in the 
code.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue23614>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue23614] Opaque error message on UTF-8 decoding to surrogates

Reply via email to