On Thu, Aug 25, 2011 at 7:28 PM, Isaac Morland <ijmor...@uwaterloo.ca> wrote: > On Thu, 25 Aug 2011, Guido van Rossum wrote: > >> I'm not sure what should happen with UTF-8 when it (in flagrant >> violation of the standard, I presume) contains two separately-encoded >> surrogates forming a valid surrogate pair; probably whatever the UTF-8 >> codec does on a wide build today should be good enough. Similarly for >> encoding to UTF-8 on a wide build if one managed to create a string >> containing a surrogate pair. Basically, I'm for a >> garbage-in-garbage-out approach (with separate library functions to >> detect garbage if the app is worried about it). > > If it's called UTF-8, there is no decision to be taken as to decoder > behaviour - any byte sequence not permitted by the Unicode standard must > result in an error (although, of course, *how* the error is to be reported > could legitimately be the subject of endless discussion). There are > security implications to violating the standard so this isn't just > legalistic purity.
You have a point. The security issues cannot be seen separate from all the other issues. The folks inside Google who care about Unicode often harp on this. So I stand corrected. I am fine with codecs treating code points or code point sequences that the Unicode standard doesn't like (e.g. lone surrogates) the same way as more severe errors in the encoded bytes (lots of byte sequences already aren't valid UTF-8). I just hope this doesn't require normal forms or other expensive operations; I hope it's limited to rejecting invalid use of surrogates or other values that are not valid code points (e.g. 0, or >= 2**21). > Hmmm, doesn't look good: > > Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49) > [GCC 4.2.1 (Apple Inc. build 5646)] on darwin > Type "help", "copyright", "credits" or "license" for more information. >>>> >>>> '\xed\xb0\x80'.decode ('utf-8') > > u'\udc00' >>>> > > Incorrect! Although this is a narrow build - I can't say what the wide > build would do. > > For reasons of practicality, it may be appropriate to provide easy access to > a CESU-8 decoder in addition to the normal UTF-8 decoder, but it must not be > called UTF-8. Other variations may also find use if provided. > > See UTF-8 RFC: http://www.ietf.org/rfc/rfc3629.txt > > And CESU-8 technical report: http://www.unicode.org/reports/tr26/ Thanks for the links! I also like the term "supplemental character" (a code point >= 2**16). And I note that they talk about characters were we've just agreed that we should say code points... -- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com