Tim Peters <t...@python.org> added the comment:

An "aha!" moment for me:  I noted before that replacing the tuple hash loop with

    Py_uhash_t t = (Py_uhash_t)y;
    t ^= t << 1;
    x = (x * mult) + t;

does OK (64-bit build):  most tests had no collisions, but the original tuple 
test had 24 (a modest failure) and the new one 6.

Horrible results with a different scheme prompted me to add one line before the 
shift-xor:

    t = (t >> 3) | (t << 61);

That is, merely rotate the input right by 3 bits first.  Disaster!  Almost all 
tests suffered major collisions, and the old test 235424 and the new one 344919.

What's going on?  With hindsight, it's obvious:  multiplication is a horrible 
"mixer" _in this context_.  Because nearly all the tests are slinging little 
integers, almost all the input variation is in the last handful of bits.  
Multiplication is great for spreading low-order bits around.  But rotate them 
to the high end, and multiplication is next to useless:  it only propagates 
them "to the left", and they're already at the left end then.  This part has 
virtually nothing to do with whether + or ^ is used, or with whether 
multiplication is done before or after.  It's actually the multiplication 
that's the weakness, and has nothing to do with which multiplier is used.

This isn't a problem with FNV or DJB when used _as intended_, because their 
output is much wider than their inputs.  The high-order bit of an input for 
them is still a lower-order bit to their much wider multiplication, and 
eventually propagates.  It isn't a problem for "multiplicative hashing" 
algorithms either, because those use a double-width multiply and only retain 
the _high_ half of a double-width product.  We're only retaining the _lower_ 
half.

I also noted before that replacing the shift-xor with the frozenset hash's 
input scrambler:

    t = ((t ^ 89869747UL) ^ (t << 16)) * 3644798167UL;

worked great.  But it's equally a disaster if the inputs are rotated right by 3 
first.  Same reason:  it too only propagates "to the left".

So almost all tuple hash schemes on the table, both current and various 
experiments here, are systematic disasters if input hashes vary mostly in the 
high-order bits.  We haven't noticed because the current string hashes vary 
about the same in all bit positions, while contiguous ranges of not-huge ints 
have hashes that vary in the low-order bits.

The only schemes in all these messages that are "obviously immune" are those 
that used a (any) flavor of FNV or DJB as intended, treating each input hash as 
a sequence of unsigned bytes.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue34751>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to