[issue34751] Hash collisions for tuples

Tim Peters Sun, 07 Oct 2018 20:40:39 -0700


Tim Peters <t...@python.org> added the comment:


Here's htest.py output from

git pull https://github.com/jdemeyer/cpython.git bpo34751

and a variation.  The number of collisions in the variation appear in square 
brackets at the end of each line.

32-bit build:

range(100) by 3; 32-bit hash codes; mean 116.42 got 0 [0]
-10 .. 8 by 4; 32-bit hash codes; mean 1.28 got 0 [0]
-50 .. 50 less -1 by 3; 32-bit hash codes; mean 116.42 got 0 [0]
0..99 << 60 by 3; 32-bit hash codes; mean 116.42 got 0 [0]
[-3, 3] by 20; 32-bit hash codes; mean 128.00 got 140 [108]
[0.5, 0.25] by 20; 32-bit hash codes; mean 128.00 got 99 [154]
old tuple test; 32-bit hash codes; mean 7.43 got 4 [2]
new tuple test; 32-bit hash codes; mean 13.87 got 19 [21]

64-bit build:

range(100) by 3; 64-bit hash codes; mean 0.00 got 0 [0]
range(100) by 3; 32-bit lower hash codes; mean 116.42 got 130 [0]
range(100) by 3; 32-bit upper hash codes; mean 116.42 got 94 [6]
-10 .. 8 by 4; 64-bit hash codes; mean 0.00 got 0 [0]
-10 .. 8 by 4; 32-bit lower hash codes; mean 1.28 got 1 [0]
-10 .. 8 by 4; 32-bit upper hash codes; mean 1.28 got 1 [0]
-50 .. 50 less -1 by 3; 64-bit hash codes; mean 0.00 got 0 [0]
-50 .. 50 less -1 by 3; 32-bit lower hash codes; mean 116.42 got 128 [0]
-50 .. 50 less -1 by 3; 32-bit upper hash codes; mean 116.42 got 116 [8]
0..99 << 60 by 3; 64-bit hash codes; mean 0.00 got 0 [0]
0..99 << 60 by 3; 32-bit lower hash codes; mean 116.42 got 123 [258]
0..99 << 60 by 3; 32-bit upper hash codes; mean 116.42 got 121 [0]
[-3, 3] by 20; 64-bit hash codes; mean 0.00 got 0 [0]
[-3, 3] by 20; 32-bit lower hash codes; mean 128.00 got 129 [117]
[-3, 3] by 20; 32-bit upper hash codes; mean 128.00 got 137 [115]
[0.5, 0.25] by 20; 64-bit hash codes; mean 0.00 got 0 [0]
[0.5, 0.25] by 20; 32-bit lower hash codes; mean 128.00 got 126 [131]
[0.5, 0.25] by 20; 32-bit upper hash codes; mean 128.00 got 137 [130]
old tuple test; 64-bit hash codes; mean 0.00 got 0 [0]
old tuple test; 32-bit lower hash codes; mean 7.43 got 12 [5]
old tuple test; 32-bit upper hash codes; mean 7.43 got 54 [52]
new tuple test; 64-bit hash codes; mean 0.00 got 0 [0]
new tuple test; 32-bit lower hash codes; mean 13.87 got 10 [6]
new tuple test; 32-bit upper hash codes; mean 13.87 got 20 [30]

They're all fine by me.  This is what the variation does:

1. Removes all "final mix" code after the loop ends.
2. Changes initialization to add in the length:

    Py_uhash_t acc = _PyHASH_XXPRIME_5 + (Py_uhash_t)len;

The vast bulk of the "final mix" code applies a chain of tuple-agnostic 
permutations.  The only exception is adding in the length, which #2 moves to 
the start instead.

Applying permutations after the loop ends changes nothing about the number of 
full-width hash collisions, which is Python's _primary_ concern.  Two 
full-width hash codes are the same after the final mix if and only if they're 
the same before the final mix starts.

In xxHash they're concerned about getting close to "avalanche" perfection (each 
input bit affects all final output bits with probability about 0.5 for each).  
There aren't enough permutations _inside_ the loop to achieve that for the last 
input or two, so they pile up more permutations after the loop.

While we do care about "propagate left" and "propagate right" to mix up very 
regular and/or sparse hash codes, "avalanche perfection" is of no particular 
value to us.  To the contrary, in _typical_ cases like `product(range(100), 
repeat=3)` it's better for us _not_ to destroy all the regularity:

range(100) by 3; 32-bit lower hash codes; mean 116.42 got 130 [0]
range(100) by 3; 32-bit upper hash codes; mean 116.42 got 94 [6]

"Avalanche perfection" is what drives the number of collisions up with the 
final mix code intact.  Without it, we get "waaaaaaay better than random" 
numbers of collisions.

The other reason it's attractive to drop it:  the final mix code is one long 
critical path (no instruction-level parallelism), adding 8 serialized machine 
instructions including two multiplies.  That's a real expense for 
typically-short tuples used as dict keys or set elements.

Nothing is anywhere near disaster territory either way, although "worse than 
random" remains quite possible, especially when cutting a 64-bit hash in half 
(do note that, either way, there are no collisions in any test when retaining 
the full 64-bit hash code).

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue34751>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue34751] Hash collisions for tuples

Reply via email to