On Wed, Mar 13, 2013 at 7:41 PM, Sebastian Schelter <[email protected]> wrote:
> Hadoop has to reschedule every iteration as separate job, reread the > input data from disk and write the iterations result to HDFS. In fact an > ALS iteration always includes twice of these things as it needs two M/R > jobs. GraphLab/Giraph/Stratosphere on the other hand have to do neither > of these three things (GraphLab even doesn't do synchronous iterations) > and I highly doubt that a Hadoop implementation can get on par performance. > That's all true but would you imagine I/O is 97.5% of the run-time? A 100-feature vector is 400 bytes, but to compute an update you need to invert a 100x100 matrix. I can't see the former taking 40x longer than the latter. That's why I bet you'll find the current implementation is nothing like 40x slower. 2x? maybe. And 2x is nothing to sneeze at!
