On 8/26/07, Guido van Rossum <[EMAIL PROTECTED]> wrote: > But I'm wondering if passing a Unicode string to the various hash > digest functions should work at all! Hashes are defined on sequences > of bytes, and IMO we should insist on the user to pass us bytes, and > not second-guess what to do with Unicode.
Conceptually, unicode *by itself* can't be represented as a buffer. What can be represented is a unicode string + an encoding. The question is whether the hash function needs to know the encoding to figure out the hash. If you're hashing arbitrary bytes, then it doesn't really matter -- there is no expectation that a recoding should have the same hash. For hashing as a shortcut to __ne__, it does matter for text. Unfortunately, for historical reasons, plenty of code grabs the string buffer expecting text. For dict comparisons, we really ought to specify the equality (and therefore hash) in terms of a canonical equivalent, encoded in X (It isn't clear to me that X should be UTF-8 in particular, but the main thing is to pick something.) The alternative is that defensive code will need to do a (normally useless boilerplate) decode/canonicalize/reencode dance before dictionary checks and insertions. I would rather see that boilerplate done once in the unicode type (and again in any equivalent types, if need be), because (1) most storage type/encodings would be able to take shortcuts. (2) if people don't do the defensive coding, the bugs will be very obscure -jJ _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com