Tim Peters <[email protected]> added the comment:
Some results of the "add perturb shifted left 1 instead" approach. These come
from using an old pile of Python code I have that allows for easy investigation
of different collision probe strategies.
- As expected, because it has no fixed points, there are no bad cases in a
dict/set with only two elements, _unless_ you're searching for a third distinct
element. Ironically, if they all, e.g., have hash code -2, no problem at all:
then 3 probes settle it. But if they're "random" hash codes (hashes of strings
are good enough for testing this), then every now and again it can take up to
10 probes. It all depends on accidents of how the various bits get fed in
across perturb shifts. However, with random hashes, searching for an element
that's there never takes more than 2 probes in the new way (again because
there's never a fixed point). Under the current way I've seen it take as many
as 6.
- On average, with random hashes and 2 elements in an 8-slot table, the average
number of probes with a successful search fall from 1.07 to 1.06, and on an
unsuccessful search from 1.33 to 1.30. Theoretically ideal ("uniform hashing"
- each object visits the slots in its own random permutation of slot order) are
1.06/1.29.
- Random hashes in an 8-slot table: average # probes for successful and
failing searches
1 entry
current 1.00 1.14
new 1.00 1.12
ideal 1.00 1.12
2 entries
current 1.07 1.33
new 1.06 1.30
ideal 1.06 1.29
3 entries
current 1.16 1.60
new 1.14 1.56
ideal 1.14 1.50
4 entries
current 1.27 2.00
new 1.25 1.93
ideal 1.23 1.80
5 entries
current 1.42 2.66
new 1.38 2.56
ideal 1.34 2.25
Note: those aren't typos ;-) For example, if there's only 1 entry, how can it
take more than one probe for a failing search? Easy! The first probe hits the
only entry, it doesn't match, and so it takes a second probe to be sure the new
element isn't present. In a perfect world, that would happen 12.5% of the time
(1 in 8). But in the presence of fixed points ("current"), it can take _more_
than 2. The most probes I saw in "current" was 6 (it hit the only entry 5
times in a row).
- That account ends at 5 entries because we don't let an 8-slot table contain
more than 5 entries.
- Small as those differences are, they get smaller for larger tables. By the
time we get to 1024 slots, three digits aren't enough to distinguish "current"
from "ideal". Even at 256 slots, it's just 1-or-2 ULP wobble.
So that's the good side:
- Eliminates the "surprises" in this bug report.
- For more realistic ordinary cases ("random" hashes), for small tables it buys
small but very consistent reductions in average probes needed for both
successful and failing searches. It also, in general (but not spelled out
above), cuts the rare longest probe chains.
The bad side:
- For small tables we're talking about nanoseconds regardless.
- That adding a new shift is "free" relies on that x86's LEA instruction
supports implicitly shifting an addend, and on compilers exploiting that.
- Some cases will get worse. That's just pragmatic fact. The intuition here
is that `perturb` is _trying_ to get high-order hash bits into play, and
"current" gets 5 more into play on each iteration. But "new" shifts perturb
left 1 before adding, preventing them from having any effect at all on the last
new bit.
This can have seemingly chaotic effects. For example, with entries of the form
i*1024 in a table with 2**13 slots and 5,461 entries ("full" - the most we
allow),
current 3.09 5.52
new 3.22 4.78
ideal 1.65 3.00
So the change hurts successful lookups but helps failing ones. For a full
table with 16 slots, the reverse. For a full table with 2**10 slots, it hurts
both.
Bottom line? You tell me :-) It's just not compelling to me either way. If I
were writing the code from scratch, I'd do it. But because of seemingly
chaotic effects talked about near the end above, there's always a danger _some_
code important to someone will take a hit. It's just as true that their code
will enjoy an unexpected speed boost, but nobody screams about those ;-)
----------
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue38105>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com