On Wed, Jun 5, 2013 at 7:00 PM, Dawid Weiss <[email protected]> wrote: > 1) if you know your keys are 0...N-1 (dense) then there is typically no > need to use a map; you'll do just as well with a regular lookup array :)
Yeah of course... here I think there is no guarantee of this although in many practical use cases the keys are drawn from a small range. For example 1000 user IDs are often around the range 1 - 1000 or something. > Having written the above it made me wonder why Mahout's performance does > not drop significantly on those malicious upper-bit changing benchmarks, so > I checked the code. The thing is Mahout (and I think Colt, originally) uses > double hashing with prime-sized tables (not 2^n-1 masking) -- this is nice > and indeed avoids the problems of clustering mentioned in pt. 2 above. Yes I thought this was why you could largely get away with it. If the table is big enough, collisions and chains should not be a big deal to begin with. Since you have this benchmark readily available -- what happens if you un-comment one or both of the lines of code that mixes the hash? That would really answer it. i'm curious as I have certainly made this assumption independently in the custom hashtable implementations I made long ago (which also use double hashing) and still use in the project. If it's not optimal that's important info.
