Dawid, could you explain how this is accessing the mahout trunk code? Is that part of the package step?
Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Wed, Jun 5, 2013 at 8:20 PM, Sean Owen <[email protected]> wrote: > On Wed, Jun 5, 2013 at 7:00 PM, Dawid Weiss > <[email protected]> wrote: > > 1) if you know your keys are 0...N-1 (dense) then there is typically no > > need to use a map; you'll do just as well with a regular lookup array :) > > Yeah of course... here I think there is no guarantee of this although > in many practical use cases the keys are drawn from a small range. For > example 1000 user IDs are often around the range 1 - 1000 or > something. > > > > Having written the above it made me wonder why Mahout's performance does > > not drop significantly on those malicious upper-bit changing benchmarks, > so > > I checked the code. The thing is Mahout (and I think Colt, originally) > uses > > double hashing with prime-sized tables (not 2^n-1 masking) -- this is > nice > > and indeed avoids the problems of clustering mentioned in pt. 2 above. > > Yes I thought this was why you could largely get away with it. If the > table is big enough, collisions and chains should not be a big deal to > begin with. > > Since you have this benchmark readily available -- what happens if you > un-comment one or both of the lines of code that mixes the hash? That > would really answer it. > > i'm curious as I have certainly made this assumption independently in > the custom hashtable implementations I made long ago (which also use > double hashing) and still use in the project. If it's not optimal > that's important info. >
