Hi Sean, > I don't know the history of that method, but I think there are reasons > why no mixing could be beneficial. I don't know if it ends up being > true in practice. For example, if you frequently store keys with > values 0..N-1 then in a big-enough map, and maybe even often access > the keys in order, then this tends to give you dense and linear access > to the table.
There are a few reasons. Theoretical/ motivational: 1) if you know your keys are 0...N-1 (dense) then there is typically no need to use a map; you'll do just as well with a regular lookup array :) 2) if your keys are sparsely distributed in 0...N-1 then there is a threat that you'll hit long chains of collisions for certain slot lookup implementations (linear or quadratic probing). This is in particular the case when the key is a compound of some sort of shifted values; like a long created from two integers where x = a << 32 | b. This isn't as uncommon as it may seem because it's convenient to concatenate several values into a single primitive than have the overhead of an indirect lookup or a full object. And then there are practical/ own experience considerations. 1) I thought no-rehashing would be fine in the early days of HPPC. It hurt me badly in terms of performance in real life applications (see pt. 2 above). 2) in practice rehashing is very cheap -- if you look at the benchmarks I posted recently you'll see that it adds almost no noticeable overhead (everything is within L1 cache or even inside registers). Having written the above it made me wonder why Mahout's performance does not drop significantly on those malicious upper-bit changing benchmarks, so I checked the code. The thing is Mahout (and I think Colt, originally) uses double hashing with prime-sized tables (not 2^n-1 masking) -- this is nice and indeed avoids the problems of clustering mentioned in pt. 2 above. So maybe you're right and this indeed isn't an issue given double hashing is the default slot-lookup technique. I'll leave it up to you to either close the issue as invalid or consider making the change nonetheless. Dawid > If the map is big enough that collisions are rare, then the negative > side effects you get from chaining are small and rare. > > I haven't thought deeply about it and don't have evidence it's better > or worse but there may be a reason. > > On Wed, Jun 5, 2013 at 1:49 PM, Dawid Weiss > <[email protected]> wrote: >>> But that's absolutely weird. The mixing function should take care of that. >> >>> But that's absolutely weird. The mixing function should take care of that. >> >> Sure, unless you don't have any... This is what's currently in Mahout >> (look closely at the first line!): >> >> public static int hash(int value) { >> return value; >> >> //return value * 0x278DDE6D; // see >> org.apache.mahout.math.jet.random.engine.DRand >> >> /* >> value &= 0x7FFFFFFF; // make it >=0 >> int hashCode = 0; >> do hashCode = 31*hashCode + value%10; >> while ((value /= 10) > 0); >> >> return 28629151*hashCode; // spread even further; h*31^5 >> */ >> } >> >> So there is no redistributing of keys. In my opinion this is a bug >> that should be addressed. >> >> Dawid >
