On Fri, Aug 26, 2011 at 5:59 AM, Guido van Rossum <gu...@python.org> wrote:
> On Thu, Aug 25, 2011 at 7:28 PM, Isaac Morland <ijmor...@uwaterloo.ca> > wrote: > > On Thu, 25 Aug 2011, Guido van Rossum wrote: > > > >> I'm not sure what should happen with UTF-8 when it (in flagrant > >> violation of the standard, I presume) contains two separately-encoded > >> surrogates forming a valid surrogate pair; probably whatever the UTF-8 > >> codec does on a wide build today should be good enough. > Surrogates are used and valid only in UTF-16. In UTF-8/32 they are invalid, even if they are in pair (see http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf ). Of course Python can/should be able to represent them internally regardless of the build type. >>Similarly for > >> encoding to UTF-8 on a wide build if one managed to create a string > >> containing a surrogate pair. Basically, I'm for a > >> garbage-in-garbage-out approach (with separate library functions to > >> detect garbage if the app is worried about it). > > > > If it's called UTF-8, there is no decision to be taken as to decoder > > behaviour - any byte sequence not permitted by the Unicode standard must > > result in an error (although, of course, *how* the error is to be > reported > > could legitimately be the subject of endless discussion). > What do you mean? We use the "strict" error handler by default and we can specify other handlers already. > There are > > security implications to violating the standard so this isn't just > > legalistic purity. > > You have a point. The security issues cannot be seen separate from all > the other issues. The folks inside Google who care about Unicode often > harp on this. So I stand corrected. I am fine with codecs treating > code points or code point sequences that the Unicode standard doesn't > like (e.g. lone surrogates) the same way as more severe errors in the > encoded bytes (lots of byte sequences already aren't valid UTF-8). Codecs that use the official names should stick to the standards. For example s.encode('utf-32') should either produce a valid utf-32 byte string or raise an error if 's' contains invalid characters (e.g. surrogates). We can have other internal codecs that are based on the UTF-* encodings but allow the representation of lone surrogates and even expose them if we want, but they should have a different name (even 'utf-*-something' should be ok, see http://bugs.python.org/issue12729#msg142053 from "Unicode says you can't put surrogates or noncharacters in a UTF-anything stream."). > I > just hope this doesn't require normal forms or other expensive > operations; I hope it's limited to rejecting invalid use of surrogates > or other values that are not valid code points (e.g. 0, or >= 2**21). > I think there shouldn't be any normalization done automatically by the codecs. > > > Hmmm, doesn't look good: > > > > Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49) > > [GCC 4.2.1 (Apple Inc. build 5646)] on darwin > > Type "help", "copyright", "credits" or "license" for more information. > >>>> > >>>> '\xed\xb0\x80'.decode ('utf-8') > > > > u'\udc00' > >>>> > > > > Incorrect! Although this is a narrow build - I can't say what the wide > > build would do. > The UTF-8 codec used to follow RFC 2279 and only recently has been updated to RFC 3629 (see http://bugs.python.org/issue8271#msg107074 ). On Python 2.x it still produces invalid UTF-8 because changing it is backward incompatible. In Python 2 UTF-8 can be used to encode every codepoint from 0 to 10FFFF, and it always works. If we change it now it might start raising errors for an operation that never raised them before (see http://bugs.python.org/issue12729#msg142047 ). Luckily this is fixed in Python 3.x. I think there are more codepoints/byte sequences that should be rejected while encoding/decoding though, in both UTF-8 and UTF-16/32, but I haven't looked at them yet (I would be happy to fix these for 3.3 or even 2.7/3.2 (if applicable), so if you find mismatches with the Unicode standard and report an issue, feel free to assign it to me). Best Regards, Ezio Melotti > > > For reasons of practicality, it may be appropriate to provide easy access > to > > a CESU-8 decoder in addition to the normal UTF-8 decoder, but it must not > be > > called UTF-8. Other variations may also find use if provided. > > > > See UTF-8 RFC: http://www.ietf.org/rfc/rfc3629.txt > > > > And CESU-8 technical report: http://www.unicode.org/reports/tr26/ > > Thanks for the links! I also like the term "supplemental character" (a > code point >= 2**16). And I note that they talk about characters were > we've just agreed that we should say code points... > > -- > --Guido van Rossum (python.org/~guido <http://python.org/%7Eguido>) > _______________________________________________ >
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com