On Tue, Mar 12, 2013 at 10:32 PM, Sebastian Schelter <[email protected]> wrote:

>
> It's not a bad way per se, its a repartition join. Side-loading means
> doing a broadcast join, which is more performant but only works as long
> as the broadcasted side fits into the receivers memory. You're right, we
> removed the repartition variant.
>

Yes, fitting in memory is an issue. You only need the vectors that are
touched by the input that goes to that particular reducer and that can be
known ahead of time. When the parallelism is pretty high, that subset is a
lot smaller than the total number of vectors.



>
> >
> > I've put a lot of work into an ALS implementation on M/R. It ought to be
> > not far off optimal. So I tried benchmarking on a similar setup --
> Netflix
> > data, 4 8-core machines, 20 features, etc.
>
> Interesting, can you share a few more details what you did to achieve
> that performance?
>
>
(The cluster itself was just Amazon EMR, 4x m2.4xlarge, which are 8 core /
4 reducer machines)

Well I think it's mostly the 'broadcast join' above. I'd imagine. I can't
think of what else makes a 40x difference. I mean that the existing Mahout
version has to also be in that ballpark too.

There are a number of Hadoop settings that help here too, like block
compression for map and reduce output, reusing JVMs, OOB heartbeat,
speculative execution (on by default now?), sort factor and buffer size,
etc.



> I think the main argument of the authors is that their programming and
> execution model is a much better fit for the problem.
>
>
Yes, it's harder to do well on M/R, not impossible, and that's a gap that a
good library or product can bridge. Here I don't think it's even that hard
to get "close" to a purpose-built solution on M/R? The upshot is being
compatible with people's existing distributed computing infrastructure. I
may be making point everyone already agreed with.

Reply via email to