Re: Discussion Of ML environment/MR, Mahout

Sean Owen Thu, 14 Mar 2013 09:02:46 -0700

On Wed, Mar 13, 2013 at 7:41 PM, Sebastian Schelter <[email protected]> wrote:


> Hadoop has to reschedule every iteration as separate job, reread the
> input data from disk and write the iterations result to HDFS. In fact an
> ALS iteration always includes twice of these things as it needs two M/R
> jobs. GraphLab/Giraph/Stratosphere on the other hand have to do neither
> of these three things (GraphLab even doesn't do synchronous iterations)
> and I highly doubt that a Hadoop implementation can get on par performance.
>

That's all true but would you imagine I/O is 97.5% of the run-time? A
100-feature vector is 400 bytes, but to compute an update you need to
invert a 100x100 matrix. I can't see the former taking 40x longer than the
latter. That's why I bet you'll find the current implementation is nothing
like 40x slower.

2x? maybe. And 2x is nothing to sneeze at!

Re: Discussion Of ML environment/MR, Mahout

Reply via email to