On Mon, Mar 9, 2015 at 5:25 AM, Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote: > Marko Rauhamaa wrote: > >> Chris Angelico <ros...@gmail.com>: >> >>> Once again, you appear to be surprised that invalid data is failing. >>> Why is this so strange? U+DD00 is not a valid character. > > But it is a valid non-character code point. > >>> It is quite correct to throw this error. >> >> '\udd00' is a valid str object: > > Is it though? Perhaps the bug is not UTF-8's inability to encode lone > surrogates, but that Python allows you to create lone surrogates in the > first place. That's not a rhetorical question. It's a genuine question.
Ah, I see the confusion. Yes, it is plausible to permit the UTF-8-like encoding of surrogates; but it's illegal according to the RFC: https://tools.ietf.org/html/rfc3629 """ The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters. """ They're not valid characters, and the UTF-8 spec explicitly says that they must not be encoded. Python is fully spec-compliant in rejecting these. Some encoders [1] will permit them, but the resulting stream is invalid UTF-8, just as CESU-8 and Modified UTF-8 are (the latter being "UTF-8, only U+0000 is represented as C0 80"). ChrisA [1] eg http://pike.lysator.liu.se/generated/manual/modref/ex/predef_3A_3A/string_to_utf8.html optionally -- https://mail.python.org/mailman/listinfo/python-list