[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

John Machin Thu, 01 Apr 2010 00:29:56 -0700

John Machin <[email protected]> added the comment:

#ezio.melotti: """I'm considering valid all the bytes that start with '10...'"""


Sorry, WRONG. Read what I wrote: """Further, some bytes in the range 80-BF are 
NOT always valid as the first continuation byte, it depends on what starter 
byte they follow."""

Consider these sequences: (1) E0 80 80 (2) E0 9F 80. Both are invalid sequences 
(over-long). Specifically the first continuation byte may not be in 80-9F. 
Those bytes start with '10...' but they are invalid after an E0 starter byte.

Please read "Table 3-7. Well-Formed UTF-8 Byte Sequences" and surrounding text 
in Unicode 5.2.0 chapter 3 (bearing in mind that CPython (for good reasons) 
doesn't implement the surrogates restriction, so that the special case for 
starter byte ED is not used in CPython). Note the other 3 special cases for the 
first continuation byte.

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue8271>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

Reply via email to