On Wednesday, 19 November 2014 at 02:16:36 UTC, Steven Schveighoffer wrote:
From all I remember about hash tables, you are supposed to minimize the number of collisions when adding/looking up data. This helps keep the O(1) lookup property intact.
That is the theory, but the real world is sometime different. It depends on how you resolve collisions. If you do it via some kind of linked list, then yes, you must remove all collisions. But if you do it by putting the element in a slot close to its actual position, then thing gets different. In that case, you don't want to minimize collision, but to minimize the distance to the object, in order for it to be in the cache line. After all, unless your hash table is very small, the first hit is most likely a cache miss, meaning ~300 cycles. at this point the cache line is in L1, with a 3 cycle access time. So accessing several element in the same cache line is often preferable. You may want to read about robin hood hash table for a simple example of this. These hash tables are fast even with a very high collision rate.
