I'm in favor of not allowing unicode for hash functions. Depending on the system default encoding for a hash will not be portable.
another question for hashlib: It uses PyArg_Parse to get a single 's' out of an optional parameter [see the code] and I couldn't figure out what the best thing to do there was. It just needs a C string to pass to openssl to lookup a hash function by name. Its C so i doubt it'll ever be anything but ascii. How should that parameter be parsed instead of the old 's' string format? PyBUF_CHARACTER actually sounds ideal in that case assuming it guarantees UTF-8 but I wasn't clear that it did that (is it always utf-8 or the possibly useless as far as APIs expecting C strings are concerned system "default encoding")? Requiring a bytes object would also work but I really don't like the idea of users needing to use a specific type for something so simple. (i consider string constants with their preceding b, r, u, s, type characters ugly in code without a good reason for them to be there) test_hashlib.py passed on the x86 osx system i was using to write the code. I neglected to run the full suite or grep for hashlib in other test suites and run those so i missed the test_unicodedata failure, sorry about the breakage. Is it just me or do unicode objects supporting the buffer api seem like an odd concept given that buffer api consumers (rather than unicode consumers) shouldn't need to know about encodings of the data being received. -gps On 8/26/07, Guido van Rossum <[EMAIL PROTECTED]> wrote: > Change r57490 by Gregory P Smith broke a test in test_unicodedata and, > on PPC OSX, several tests in test_hashlib. > > Looking into this it's pretty clear *why* it broke: before, the 's#' > format code was used, while Gregory's change changed this into using > the buffer API (to ensure the data won't move around). Now, when a > (Unicode) string is passed to s#, it uses the UTF-8 encoding. But the > buffer API uses the raw bytes in the Unicode object, which is > typically UTF-16 or UTF-32. (I can't quite figure out why the tests > didn't fail on my Linux box; I'm guessing it's an endianness issue, > but it can't be that simple. Perhaps that box happens to be falling > back on a different implementation of the checksums?) > > I checked in a fix (because I don't like broken tests :-) which > restores the old behavior by passing PyBUF_CHARACTER to > PyObject_GetBuffer(), which enables a special case in the buffer API > for PyUnicode that returns the UTF-8 encoded bytes instead of the raw > bytes. (I still find this questionable, especially since a few random > places in bytesobject.c also use PyBUF_CHARACTER, presumably to make > tests pass, but for the *bytes* type, requesting *characters* (even > encoded ones) is iffy. > > But I'm wondering if passing a Unicode string to the various hash > digest functions should work at all! Hashes are defined on sequences > of bytes, and IMO we should insist on the user to pass us bytes, and > not second-guess what to do with Unicode. > > Opinions? > > -- > --Guido van Rossum (home page: http://www.python.org/~guido/) > _______________________________________________ > Python-3000 mailing list > Python-3000@python.org > http://mail.python.org/mailman/listinfo/python-3000 > Unsubscribe: > http://mail.python.org/mailman/options/python-3000/greg%40krypto.org > _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com