On Wed, Jun 5, 2013 at 7:00 PM, Dawid Weiss
<[email protected]> wrote:
> 1) if you know your keys are 0...N-1 (dense) then there is typically no
> need to use a map; you'll do just as well with a regular lookup array :)

Yeah of course... here I think there is no guarantee of this although
in many practical use cases the keys are drawn from a small range. For
example 1000 user IDs are often around the range 1 - 1000 or
something.


> Having written the above it made me wonder why Mahout's performance does
> not drop significantly on those malicious upper-bit changing benchmarks, so
> I checked the code. The thing is Mahout (and I think Colt, originally) uses
> double hashing with prime-sized tables (not 2^n-1 masking) -- this is nice
> and indeed avoids the problems of clustering mentioned in pt. 2 above.

Yes I thought this was why you could largely get away with it. If the
table is big enough, collisions and chains should not be a big deal to
begin with.

Since you have this benchmark readily available -- what happens if you
un-comment one or both of the lines of code that mixes the hash? That
would really answer it.

i'm curious as I have certainly made this assumption independently in
the custom hashtable implementations I made long ago (which also use
double hashing) and still use in the project. If it's not optimal
that's important info.

Reply via email to