On Nov 28, 1:24 pm, Scott David Daniels <[EMAIL PROTECTED]> wrote: > Jeff H wrote: > > hashlib.md5 does not appear to like unicode, > > UnicodeEncodeError: 'ascii' codec can't encode character u'\xa6' in > > position 1650: ordinal not in range(128) > > > After googling, I've found BDFL and others on Py3K talking about the > > problems of hashing non-bytes (i.e. buffers) ... > > Unicode is characters, not a character encoding. > You could hash on a utf-8 encoding of the Unicode. > > > So what is the canonical way to hash unicode? > > * convert unicode to local > > * hash in current local > > ??? > > There is no _the_ way to hash Unicode, any more than > there is no _the_ way to hash vectors. You need to > convert the abstract entity something concrete with > a well-defined representation in bytes, and hash that. > > > Is this just a problem for md5 hashes that I would not encounter using > > a different method? i.e. Should I just use the built-in hash function? > > No, it is a definitional problem. Perhaps you could explain how you > want to use the hash. If the internal hash is acceptable (e.g. for > grouping in dictionaries within a single run), use that. If you intend > to store and compare on the same system, say that. If you want cross- > platform execution of your code to produce the same hashes, say that. > A hash is a means to an end, and it is hard to give advice without > knowing the goal. > I am checking for changes to large text objects stored in a database against outside sources. So the hash needs to be reproducible/stable.
> --Scott David Daniels > [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list