[issue23614] Opaque error message on UTF-8 decoding to surrogates

Chris Angelico Fri, 13 Mar 2015 14:54:36 -0700

Chris Angelico added the comment:

Nice document. Is that actually how Python's decoder checks things? Does the 
decoder have different definitions of "valid continuation byte" based on the 
lead byte? If that's the case... well, ten out of ten for complying with the 
spec, to be sure, but unfortunately it leads to some opaque error messages!


I haven't looked into the code even a little bit, but would it be possible to 
have a specific error message attached to certain "invalid continuation bytes"?

* E0 followed by 80..9F: "non-shortest form"
* ED followed by A0..BF: "surrogate"
* F4 followed by 90..BF: "outside defined range"

If that's too hard, it'd at least be helpful to point out that the "invalid 
continuation byte" is not the same as the "byte 0x?? in position ?" - the 
rejection here is actually of the B4 that follows it. How does this look?

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid 
continuation byte 0xb4 for this start byte

(BTW, I think Pike's decoder just always emits two bytes, no matter what the 
actual errant stream (after all, there's no way to know how many bytes "ought 
to have been" one character, when there's an error in it). So it's incomplete, 
yes, but when you're dealing with wrong data, completeness isn't all that 
possible anyway.)

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue23614>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue23614] Opaque error message on UTF-8 decoding to surrogates

Reply via email to