Tim Peters <t...@python.org> added the comment:
Some results of the "add perturb shifted left 1 instead" approach. These come from using an old pile of Python code I have that allows for easy investigation of different collision probe strategies. - As expected, because it has no fixed points, there are no bad cases in a dict/set with only two elements, _unless_ you're searching for a third distinct element. Ironically, if they all, e.g., have hash code -2, no problem at all: then 3 probes settle it. But if they're "random" hash codes (hashes of strings are good enough for testing this), then every now and again it can take up to 10 probes. It all depends on accidents of how the various bits get fed in across perturb shifts. However, with random hashes, searching for an element that's there never takes more than 2 probes in the new way (again because there's never a fixed point). Under the current way I've seen it take as many as 6. - On average, with random hashes and 2 elements in an 8-slot table, the average number of probes with a successful search fall from 1.07 to 1.06, and on an unsuccessful search from 1.33 to 1.30. Theoretically ideal ("uniform hashing" - each object visits the slots in its own random permutation of slot order) are 1.06/1.29. - Random hashes in an 8-slot table: average # probes for successful and failing searches 1 entry current 1.00 1.14 new 1.00 1.12 ideal 1.00 1.12 2 entries current 1.07 1.33 new 1.06 1.30 ideal 1.06 1.29 3 entries current 1.16 1.60 new 1.14 1.56 ideal 1.14 1.50 4 entries current 1.27 2.00 new 1.25 1.93 ideal 1.23 1.80 5 entries current 1.42 2.66 new 1.38 2.56 ideal 1.34 2.25 Note: those aren't typos ;-) For example, if there's only 1 entry, how can it take more than one probe for a failing search? Easy! The first probe hits the only entry, it doesn't match, and so it takes a second probe to be sure the new element isn't present. In a perfect world, that would happen 12.5% of the time (1 in 8). But in the presence of fixed points ("current"), it can take _more_ than 2. The most probes I saw in "current" was 6 (it hit the only entry 5 times in a row). - That account ends at 5 entries because we don't let an 8-slot table contain more than 5 entries. - Small as those differences are, they get smaller for larger tables. By the time we get to 1024 slots, three digits aren't enough to distinguish "current" from "ideal". Even at 256 slots, it's just 1-or-2 ULP wobble. So that's the good side: - Eliminates the "surprises" in this bug report. - For more realistic ordinary cases ("random" hashes), for small tables it buys small but very consistent reductions in average probes needed for both successful and failing searches. It also, in general (but not spelled out above), cuts the rare longest probe chains. The bad side: - For small tables we're talking about nanoseconds regardless. - That adding a new shift is "free" relies on that x86's LEA instruction supports implicitly shifting an addend, and on compilers exploiting that. - Some cases will get worse. That's just pragmatic fact. The intuition here is that `perturb` is _trying_ to get high-order hash bits into play, and "current" gets 5 more into play on each iteration. But "new" shifts perturb left 1 before adding, preventing them from having any effect at all on the last new bit. This can have seemingly chaotic effects. For example, with entries of the form i*1024 in a table with 2**13 slots and 5,461 entries ("full" - the most we allow), current 3.09 5.52 new 3.22 4.78 ideal 1.65 3.00 So the change hurts successful lookups but helps failing ones. For a full table with 16 slots, the reverse. For a full table with 2**10 slots, it hurts both. Bottom line? You tell me :-) It's just not compelling to me either way. If I were writing the code from scratch, I'd do it. But because of seemingly chaotic effects talked about near the end above, there's always a danger _some_ code important to someone will take a hit. It's just as true that their code will enjoy an unexpected speed boost, but nobody screams about those ;-) ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue38105> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com