Tim Peters <[email protected]> added the comment:
Noting that the failure of high-order bits to propagate is one form of
"avalanche" failure. A recent 64-bit hash, SeaHash, takes this approach which
is said to have provably "perfect" avalanche behavior:
Py_uhash_t t = (Py_uhash_t)y;
t *= (Py_uhash_t)0x6eed0e9da4d94a4fULL;
t ^= (t >> 32) >> (t >> 60);
t *= (Py_uhash_t)0x6eed0e9da4d94a4fULL;
The giant constant is just a 63-bit "patternless" prime (for whatever reason,
the author wants this transformation to be easily invertible, so a prime is
necessary - this is NOT a "crypto" hash). The first multiply propagates
low-order bits left. Then the next line uses high-order bits to change
low-order bits. Extracting a variable shift count from the data itself is a
clever idea taken from the PCG family of PRNGs - you have to work to contrive
data where this doesn't "act kinda random". The last line then propagates the
- now altered by the high-order bits - lower-order bits left again.
Followed by
x = (x * mult) + t;
this yields a tuple hash that passes all the tests I have. The only collisions
are in the new tuple test, which suffers 14.
Better, add the "catastrophic" right-rotate
t = (t >> 3) | (t << 61);
at the start and it's still only the new tuple test that has a collision - it
rises to 19 then, close to (but still under) its failure cutoff.
What I haven't tried: in context it would be nice to eliminate the second
multiply by the giant constant. We're doing a multiply anyway to fold `t` into
`x`, which will propagate the low-order bits left on the next iteration's `x *
mult`. That would ruin SeaHash's provable guarantees, but I'm more concerned
about saving some cycles than purity ;-) If that added enough collisions to
push the second tuple test "not much" over the limit, I'd raise its limit.
Gonzo: "the real" SeaHash duplicates the code above and works on 4 inputs
simultaneously, designed to keep a modern processor's instruction pipelines as
busy as possible. I'd rather not bother (most tuples are short).
So this is principled and, if the SeaHash theory is right, immune to any simple
kind of catastrophic failure. Is it worth it? I'd sleep better, yes ;-) Note
that the first multiply by the giant constant can be active at the same time as
the `x * mult` at the end, so I'm guessing the speed hit would be bearable.
There's no truly _cheap_ way to get good propagation from all bit positions.
SeaHash has the fastest way to do that I've encountered.
----------
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue34751>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com