Re: Discussion Of ML environment/MR, Mahout

Sebastian Schelter Wed, 13 Mar 2013 12:41:57 -0700

I have run the current ALS code on netflix some month ago, but not done
a thorough benchmark. I can do this next week, when I have access to a
cluster again and give some numbers, so we can compare them to GraphLab
(and to your implementation, if you want). I have done lots of
experiments using graph algorithms lately (mainly pagerank), where novel
systems easily outperform hadoop by an order of magnitude.

Hadoop has to reschedule every iteration as separate job, reread the
input data from disk and write the iterations result to HDFS. In fact an
ALS iteration always includes twice of these things as it needs two M/R
jobs. GraphLab/Giraph/Stratosphere on the other hand have to do neither
of these three things (GraphLab even doesn't do synchronous iterations)
and I highly doubt that a Hadoop implementation can get on par performance.

I won't say however that a highly optimized Hadoop implementation won't
give you a performance suitable for real-world usecases though. If I
were to develop a commercial product atm, I would clearly set on Hadoop.

It's just that I recently saw the performance of our research prototype
(doing an iteration of PageRank on a billion edges dataset in 20 secs on
two dozen machines) and that makes me think that a lot concepts
inherited from Hadoop (such as its fault tolerance e.g.) should be
thought over. I assume that this will happen in the near future (next
month/years) and I think that therefore Mahout development should start
to look into other systems too (in some experimental way).

/s

On 13.03.2013 11:00, Sean Owen wrote:
> On Wed, Mar 13, 2013 at 2:04 AM, Dmitriy Lyubimov <[email protected]> wrote:
> 
>> Yeah. The stuck point for me is page-rankish-finding stationary
>> distributions and extremely popular ALS based stuff. We've beaten the heck
>> out of it a year ago and Sebastian conclusively stated Giraph ALS knocks
>> the socks off MR version. Add to that a bisect search for a good
>>
> 
> This keeps being said, but, I thought Sebastian just said that the M/R
> version he mentioned being much slower was a different version, deleted
> from this project? See my other email. The current version is similar to
> the one I just benchmarked, and that appeared to be about as fast as
> GraphLab (still not clear if the same amount of work is being compared
> though).
> 
> This matches my hunch that these things are about the same, modulo some
> extra disk I/O, which is not most of the runtime.
> 
> I point it out in case this is underpinning many people's logic for
> rebuilding a bunch of stuff because it will be a *lot* faster. Surely some
> stuff can be done more naturally in a graph paradigm but not everything, or
> most? I'm worried about the conclusion because of cases like this.
>

Re: Discussion Of ML environment/MR, Mahout

Reply via email to