Re: Discossuon Of ML environment/MR, Mahout

Sean Owen Tue, 12 Mar 2013 15:21:23 -0700

I'm trying to separate what a particular implementation happens to do on
framework X from what is possible, and with what difficulty. I'm not
convinced that this example, maybe others, shows M/R is so bad. The
existence of a slow M/R job doesn't prove there isn't a good one.

For example, the paper says:

  We can observe that the Map function of a Hadoop ALS implementation,
  performs no computation and its only purpose is to emit copies of the
  vertex data for every edge in the graph; unnecessarily multiplying
  the amount of data that need to be tracked.
  For example, a user vertex that connects to 100 movies must
  emit the data on the user vertex 100 times, once for each movie

Yes, that's a bad way to do it. I don't even think that's how the Mahout
version does it? I thought it side-loaded one of the matrices rather than
join like this. That's "cheating" and takes work, but it's much faster than
the above. Side-loading stuff starts to make M/R workers into longer-lived
entities holding a lot of data in memory, which is more like how other
frameworks behave. Indeed it helps a lot.

I've put a lot of work into an ALS implementation on M/R. It ought to be
not far off optimal. So I tried benchmarking on a similar setup -- Netflix
data, 4 8-core machines, 20 features, etc.

I got 178 seconds for 1 iteration, which is about as fast as the reported
GraphLab time and nothing like the reported 9000 seconds for Hadoop-based.
Maybe this is measuring something more than just an iteration. If I include
all the initialization time, including reading a preprocessing and such, I
get about 800 seconds.

You can argue that the naive approach on M/R is the fair one to benchmark
and that's what happened here maybe? But this hardly seems to rule out M/R,
if a smarter approach and -- granted, a lot of honing -- achieves similar
speed. You can also say that, surely, the same level of optimization work
would probably speed up another approach even further too, and that's
probably true.

I'm still skeptical that it's unsuitable for many things, even if you can
surely imagine a better ideal framework for any given problem.

On Mon, Mar 11, 2013 at 8:45 PM, Sebastian Schelter <[email protected]> wrote:

> The GraphLab guys benchmark their ALS implementation against an old
> version of ours and in detail describe why they can achieve a 40x to 60x
> performance improvement. Most of the overhead is attributed to Hadoop
> and its programming model.
>
> Its on the left column of Page 724 in
> http://vldb.org/pvldb/vol5/p716_yuchenglow_vldb2012.pdf
>
> On 11.03.2013 21:35, Dmitriy Lyubimov wrote:
> > Exactly! I have always thought that was the main reason why ALS in Giraph
> > was faster.
> >
> > Doesn't it make strong case for a hybrid environment? Anyway, what i am
> > saying, isn't it more or less truthful to say that in pragmatic ways ALS
> > stuff in Mahout is lagging for the very reason of Mahout being
> constrained
> > to MR?
>

Re: Discossuon Of ML environment/MR, Mahout

Reply via email to