On Tue, Mar 12, 2013 at 10:32 PM, Sebastian Schelter <[email protected]> wrote:
> > It's not a bad way per se, its a repartition join. Side-loading means > doing a broadcast join, which is more performant but only works as long > as the broadcasted side fits into the receivers memory. You're right, we > removed the repartition variant. > Yes, fitting in memory is an issue. You only need the vectors that are touched by the input that goes to that particular reducer and that can be known ahead of time. When the parallelism is pretty high, that subset is a lot smaller than the total number of vectors. > > > > > I've put a lot of work into an ALS implementation on M/R. It ought to be > > not far off optimal. So I tried benchmarking on a similar setup -- > Netflix > > data, 4 8-core machines, 20 features, etc. > > Interesting, can you share a few more details what you did to achieve > that performance? > > (The cluster itself was just Amazon EMR, 4x m2.4xlarge, which are 8 core / 4 reducer machines) Well I think it's mostly the 'broadcast join' above. I'd imagine. I can't think of what else makes a 40x difference. I mean that the existing Mahout version has to also be in that ballpark too. There are a number of Hadoop settings that help here too, like block compression for map and reduce output, reusing JVMs, OOB heartbeat, speculative execution (on by default now?), sort factor and buffer size, etc. > I think the main argument of the authors is that their programming and > execution model is a much better fit for the problem. > > Yes, it's harder to do well on M/R, not impossible, and that's a gap that a good library or product can bridge. Here I don't think it's even that hard to get "close" to a purpose-built solution on M/R? The upshot is being compatible with people's existing distributed computing infrastructure. I may be making point everyone already agreed with.
