Two bits of new info: first, it's possible to get the performance of "double" without division, at least via this way:
""" # Double hashing using Fibonacci multiplication for the increment. This # does about as well as `double`, but doesn't require division. # # The multiplicative constant depends on the word size W, and is the # nearest odd integer to 2**W/((1 + sqrt(5))/2). So for a 64-bit box: # # >>> 2**64 / ((1 + decimal.getcontext().sqrt(5))/2) # Decimal('11400714819323198485.95161059') # # For a 32-bit box, it's 2654435769. # # The result is the high-order `nbits` bits of the low-order W bits of # the product. In C, the "& M" part isn't needed (unsigned * in C # returns only the low-order bits to begin with). # # Annoyance: I don't think Python dicts store `nbits`, just 2**nbits. def dfib(h, nbits, M=M64): mask = (1 << nbits) - 1 i = h & mask yield i inc = (((h * 11400714819323198485) & M) >> (64 - nbits)) | 1 while True: i = (i + inc) & mask yield i """ Second, the program I posted uses string objects as test cases. The current string hash acts like a good-quality pseudo-random number generator: change a character in the string, and the hash typically changes "a lot". It's also important to look at hashes with common patterns, because, e.g., all "sufficiently small" integers are their _own_ hash codes (e.g., hash(3) == 3). Indeed, long ago Python changed its dict implementation to avoid quadratic-time behavior for unfortunate sets of real-life integer keys. Example: change `objgen()` to `yield i << 12` instead of `yield str(i)`. Then we generate integers all of whose final 12 bits are zeroes. For all sufficiently small tables, then, these ints _all_ map to the same initial table slot (since the initial probe is taken from the last `nbits` bits). The collision resolution scheme is needed to prevent disaster. Here's a chunk of output from that, for dicts of size 2,730: """ bits 12 nslots 4,096 dlen 2,730 alpha 0.67 # built 37 theoretical avg probes for uniform hashing when found 1.65 not found 3.00 crisp when found 1.65 not found 3.00 prober linear found min 1:0.04% max 2730 mean 1365.50 fail min 2731:100.00% max 2731 mean 2731.00 prober quadratic found min 1:0.04% max 2730 mean 1365.50 fail min 2731:100.00% max 2731 mean 2731.00 prober pre28201 found min 1:0.04% max 61 mean 6.17 fail min 5:28.94% max 68 mean 9.78 prober current found min 1:0.04% max 58 mean 5.17 fail min 4:29.30% max 70 mean 8.62 prober double found min 1:0.04% max 5 mean 2.73 fail min 2:41.47% max 9 mean 3.94 prober dfib found min 1:0.04% max 9 mean 2.53 fail min 2:10.87% max 17 mean 4.48 prober uniform found min 1:66.52% max 21 mean 1.65 fail min 1:33.41% max 30 mean 2.99 """ It's a worst-case disaster for linear and quadratic probing. The `current` scheme does fine, but `double` and `dfib` do significantly better on all measures shown. `uniform` is oblivious, still achieving its theoretical average-case performance. That last isn't surprising "in theory", since it theoretically picks a probe sequence uniformly at random from the set of all table slot permutations. It's perhaps a bit surprising that this approximate _implementation_ also achieves it. But `random.seed(h)` does a relatively enormous amount of work to spray the bits of `h` all over the Twister's internal state, and then shuffle them around. In any case, the "double hashing" variants ("double" and "dfib") look very promising, doing better than the current scheme for both small tables and pathological key sets. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/