Isaac Morland, 26.08.2011 04:28:
On Thu, 25 Aug 2011, Guido van Rossum wrote:
I'm not sure what should happen with UTF-8 when it (in flagrant
violation of the standard, I presume) contains two separately-encoded
surrogates forming a valid surrogate pair; probably whatever the UTF-8
codec does on a wide build today should be good enough. Similarly for
encoding to UTF-8 on a wide build if one managed to create a string
containing a surrogate pair. Basically, I'm for a
garbage-in-garbage-out approach (with separate library functions to
detect garbage if the app is worried about it).
If it's called UTF-8, there is no decision to be taken as to decoder
behaviour - any byte sequence not permitted by the Unicode standard must
result in an error (although, of course, *how* the error is to be reported
could legitimately be the subject of endless discussion). There are
security implications to violating the standard so this isn't just
legalistic purity.
Hmmm, doesn't look good:
Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> '\xed\xb0\x80'.decode ('utf-8')
u'\udc00'
>>>
Incorrect! Although this is a narrow build - I can't say what the wide
build would do.
Works the same for me in a wide Py2.7 build, but gives me this in Py3:
Python 3.1.2 (r312:79147, Sep 27 2010, 09:57:50)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> b'\xed\xb0\x80'.decode ('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2:
illegal encoding
Same for current Py3.3 and the PEP393 build (although both have a better
exception message now: "UnicodeDecodeError: 'utf8' codec can't decode bytes
in position 0-1: invalid continuation byte").
Stefan
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com