Change r57490 by Gregory P Smith broke a test in test_unicodedata and, on PPC OSX, several tests in test_hashlib.
Looking into this it's pretty clear *why* it broke: before, the 's#' format code was used, while Gregory's change changed this into using the buffer API (to ensure the data won't move around). Now, when a (Unicode) string is passed to s#, it uses the UTF-8 encoding. But the buffer API uses the raw bytes in the Unicode object, which is typically UTF-16 or UTF-32. (I can't quite figure out why the tests didn't fail on my Linux box; I'm guessing it's an endianness issue, but it can't be that simple. Perhaps that box happens to be falling back on a different implementation of the checksums?) I checked in a fix (because I don't like broken tests :-) which restores the old behavior by passing PyBUF_CHARACTER to PyObject_GetBuffer(), which enables a special case in the buffer API for PyUnicode that returns the UTF-8 encoded bytes instead of the raw bytes. (I still find this questionable, especially since a few random places in bytesobject.c also use PyBUF_CHARACTER, presumably to make tests pass, but for the *bytes* type, requesting *characters* (even encoded ones) is iffy. But I'm wondering if passing a Unicode string to the various hash digest functions should work at all! Hashes are defined on sequences of bytes, and IMO we should insist on the user to pass us bytes, and not second-guess what to do with Unicode. Opinions? -- --Guido van Rossum (home page: http://www.python.org/~guido/) _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com