Never mind. I found the checked in jar in the lib folder. Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.
On Thu, Jun 6, 2013 at 3:13 AM, Robin Anil <[email protected]> wrote: > Dawid, could you explain how this is accessing the mahout trunk code? Is > that part of the package step? > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. > > > On Wed, Jun 5, 2013 at 8:20 PM, Sean Owen <[email protected]> wrote: > >> On Wed, Jun 5, 2013 at 7:00 PM, Dawid Weiss >> <[email protected]> wrote: >> > 1) if you know your keys are 0...N-1 (dense) then there is typically no >> > need to use a map; you'll do just as well with a regular lookup array :) >> >> Yeah of course... here I think there is no guarantee of this although >> in many practical use cases the keys are drawn from a small range. For >> example 1000 user IDs are often around the range 1 - 1000 or >> something. >> >> >> > Having written the above it made me wonder why Mahout's performance does >> > not drop significantly on those malicious upper-bit changing >> benchmarks, so >> > I checked the code. The thing is Mahout (and I think Colt, originally) >> uses >> > double hashing with prime-sized tables (not 2^n-1 masking) -- this is >> nice >> > and indeed avoids the problems of clustering mentioned in pt. 2 above. >> >> Yes I thought this was why you could largely get away with it. If the >> table is big enough, collisions and chains should not be a big deal to >> begin with. >> >> Since you have this benchmark readily available -- what happens if you >> un-comment one or both of the lines of code that mixes the hash? That >> would really answer it. >> >> i'm curious as I have certainly made this assumption independently in >> the custom hashtable implementations I made long ago (which also use >> double hashing) and still use in the project. If it's not optimal >> that's important info. >> > >
