[issue38105] hash collision when hash(x) == -2 causes many calls to eq

Tim Peters Fri, 13 Sep 2019 20:20:31 -0700


Tim Peters <[email protected]> added the comment:


Some results of the "add perturb shifted left 1 instead" approach.  These come 
from using an old pile of Python code I have that allows for easy investigation 
of different collision probe strategies.

- As expected, because it has no fixed points, there are no bad cases in a 
dict/set with only two elements, _unless_ you're searching for a third distinct 
element.  Ironically, if they all, e.g., have hash code -2, no problem at all:  
then 3 probes settle it.  But if they're "random" hash codes (hashes of strings 
are good enough for testing this), then every now and again it can take up to 
10 probes.  It all depends on accidents of how the various bits get fed in 
across perturb shifts.  However, with random hashes, searching for an element 
that's there never takes more than 2 probes in the new way (again because 
there's never a fixed point).  Under the current way I've seen it take as many 
as 6.

- On average, with random hashes and 2 elements in an 8-slot table, the average 
number of probes with a successful search fall from 1.07 to 1.06, and on an 
unsuccessful search from 1.33 to 1.30.  Theoretically ideal ("uniform hashing" 
- each object visits the slots in its own random permutation of slot order) are 
1.06/1.29.

- Random hashes in an 8-slot table:  average # probes for successful and 
failing searches

1 entry
current 1.00 1.14
new     1.00 1.12
ideal   1.00 1.12

2 entries
current 1.07 1.33
new     1.06 1.30
ideal   1.06 1.29

3 entries
current 1.16 1.60
new     1.14 1.56
ideal   1.14 1.50

4 entries
current 1.27 2.00
new     1.25 1.93
ideal   1.23 1.80

5 entries
current 1.42 2.66
new     1.38 2.56
ideal   1.34 2.25 

Note:  those aren't typos ;-)  For example, if there's only 1 entry, how can it 
take more than one probe for a failing search?  Easy!  The first probe hits the 
only entry, it doesn't match, and so it takes a second probe to be sure the new 
element isn't present.  In a perfect world, that would happen 12.5% of the time 
(1 in 8).  But in the presence of fixed points ("current"), it can take _more_ 
than 2.  The most probes I saw in "current" was 6 (it hit the only entry 5 
times in a row).

- That account ends at 5 entries because we don't let an 8-slot table contain 
more than 5 entries.

- Small as those differences are, they get smaller for larger tables.  By the 
time we get to 1024 slots, three digits aren't enough to distinguish "current" 
from "ideal".  Even at 256 slots, it's just 1-or-2 ULP wobble.

So that's the good side:

- Eliminates the "surprises" in this bug report.

- For more realistic ordinary cases ("random" hashes), for small tables it buys 
small but very consistent reductions in average probes needed for both 
successful and failing searches.  It also, in general (but not spelled out 
above), cuts the rare longest probe chains.

The bad side:

- For small tables we're talking about nanoseconds regardless.

- That adding a new shift is "free" relies on that x86's LEA instruction 
supports implicitly shifting an addend, and on compilers exploiting that.

- Some cases will get worse.  That's just pragmatic fact.  The intuition here 
is that `perturb` is _trying_ to get high-order hash bits into play, and 
"current" gets 5 more into play on each iteration.  But "new" shifts perturb 
left 1 before adding, preventing them from having any effect at all on the last 
new bit.

This can have seemingly chaotic effects.  For example, with entries of the form 
i*1024 in a table with 2**13 slots and 5,461 entries ("full" - the most we 
allow),

current 3.09 5.52
new     3.22 4.78
ideal   1.65 3.00

So the change hurts successful lookups but helps failing ones.  For a full 
table with 16 slots, the reverse.  For a full table with 2**10 slots, it hurts 
both.

Bottom line?  You tell me :-)  It's just not compelling to me either way.  If I 
were writing the code from scratch, I'd do it.  But because of seemingly 
chaotic effects talked about near the end above, there's always a danger _some_ 
code important to someone will take a hit.  It's just as true that their code 
will enjoy an unexpected speed boost, but nobody screams about those ;-)

----------

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue38105>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue38105] hash collision when hash(x) == -2 causes many calls to __eq__

Reply via email to

[issue38105] hash collision when hash(x) == -2 causes many calls to eq