Re: Discussion Of ML environment/MR, Mahout

Sebastian Schelter Sat, 16 Mar 2013 00:43:42 -0700

Sean,

you were right on this one. I haven't done a thorough benchmark and
comparison against GraphLab yet, but I reworked Mahout's ALS code (soon
to be committed) to use Multithreaded mappers and got very nice results.


Without a lot of optimizations, I was able to recompute a feature matrix
of Netflix in ~40 seconds using 23 machines (Netflix consists of 23 64MB
blocks). Indeed it follows that ALS is fast enough on Hadoop.



On 14.03.2013 17:02, Sean Owen wrote:
> On Wed, Mar 13, 2013 at 7:41 PM, Sebastian Schelter <[email protected]> wrote:
> 
>> Hadoop has to reschedule every iteration as separate job, reread the
>> input data from disk and write the iterations result to HDFS. In fact an
>> ALS iteration always includes twice of these things as it needs two M/R
>> jobs. GraphLab/Giraph/Stratosphere on the other hand have to do neither
>> of these three things (GraphLab even doesn't do synchronous iterations)
>> and I highly doubt that a Hadoop implementation can get on par performance.
>>
> 
> That's all true but would you imagine I/O is 97.5% of the run-time? A
> 100-feature vector is 400 bytes, but to compute an update you need to
> invert a 100x100 matrix. I can't see the former taking 40x longer than the
> latter. That's why I bet you'll find the current implementation is nothing
> like 40x slower.
> 
> 2x? maybe. And 2x is nothing to sneeze at!
>

Re: Discussion Of ML environment/MR, Mahout

Reply via email to