On Mar 22, 8:31 am, Bryan Olson <[EMAIL PROTECTED]> wrote: > [EMAIL PROTECTED] wrote: > > John Machin wrote: > >> On Mar 22, 1:11 am, [EMAIL PROTECTED] wrote: > >>> A collision sequence is not so rare. > >>>>>> [ hash( 2**i ) for i in range( 0, 256, 32 ) ] > >>> [1, 1, 1, 1, 1, 1, 1, 1] > >> Bryan did qualify his remarks: "If we exclude the case where an > >> adversary is choosing the keys, ..." > > > Some adversary. What, you mean, my boss or my customers? > > We mean that the party supplying the keys deliberately chose > them to make the hash table inefficient. In this thread the goal > is efficiency; a party working toward an opposing goal is an > adversary.
There are situations where this can happen I guess > If you find real-world data sets that tend to induce bad-case > behavior in Python's hash, please do tell. It would be reason > enough to adjust the hash function. The hashes in popular > software such as Python are already quite well vetted. As Python allows you to make your own hash functions for your own classes, another danger is that you are dealing with objects with a bad hash function. I've actually tried it (as I am *not* a pro!) and FWIW here are the results. -------- badhash.py ---------- class BadHash(object): def __hash__(self): return 1 # that's a bad hash function! from time import time def test(n): t0 = time() # Create a hash table of size n s = set(BadHash() for x in xrange(n)) t1 = time() # Add an object to it obj = BadHash() s.add(obj) t2 = time() # Find that object obj in s t3 = time() print "%s\t%.3e\t%.3e\t%.3e" % (n, (t1-t0)/n**2, (t2-t1)/n, (t3- t2)/n) print "n\tcreate/n^2\tadd/n\t\tfind/n" for k in range(8, 17): # I'm hoping that adding an element to a dict of size (1 << k) + 1 # will not trigger a resize of the hasmap. test((1 << k) + 1) -------------------------------- marigold:junk arno$ python badhash.py n create/n^2 add/n find/n 257 3.917e-08 6.958e-08 7.051e-08 513 3.641e-08 8.598e-08 7.018e-08 1025 3.522e-08 7.118e-08 6.722e-08 2049 3.486e-08 6.935e-08 6.982e-08 4097 3.480e-08 7.053e-08 6.931e-08 8193 3.477e-08 6.897e-08 6.981e-08 16385 3.441e-08 6.963e-08 7.075e-08 32769 3.720e-08 7.672e-08 7.885e-08 65537 3.680e-08 7.159e-08 7.510e-08 So theory and practice agree! In this case The hash table behaves just like a linked list. Lastly, if one deals with a totally ordered set of object but they are not hashable (there isn't a good hash function), then Ninereeds' idea of sorting first is still useful. -- Arnaud -- http://mail.python.org/mailman/listinfo/python-list