Re: Discossuon Of ML environment/MR, Mahout

Sebastian Schelter Tue, 12 Mar 2013 15:32:55 -0700

On 12.03.2013 23:21, Sean Owen wrote:
> I'm trying to separate what a particular implementation happens to do on
> framework X from what is possible, and with what difficulty. I'm not
> convinced that this example, maybe others, shows M/R is so bad. The
> existence of a slow M/R job doesn't prove there isn't a good one.
> 
> For example, the paper says:
> 
>   We can observe that the Map function of a Hadoop ALS implementation,
>   performs no computation and its only purpose is to emit copies of the
>   vertex data for every edge in the graph; unnecessarily multiplying
>   the amount of data that need to be tracked.
>   For example, a user vertex that connects to 100 movies must
>   emit the data on the user vertex 100 times, once for each movie
> 
> Yes, that's a bad way to do it. I don't even think that's how the Mahout
> version does it? I thought it side-loaded one of the matrices rather than
> join like this. That's "cheating" and takes work, but it's much faster than
> the above. Side-loading stuff starts to make M/R workers into longer-lived
> entities holding a lot of data in memory, which is more like how other
> frameworks behave. Indeed it helps a lot.


It's not a bad way per se, its a repartition join. Side-loading means
doing a broadcast join, which is more performant but only works as long
as the broadcasted side fits into the receivers memory. You're right, we
removed the repartition variant.

> 
> I've put a lot of work into an ALS implementation on M/R. It ought to be
> not far off optimal. So I tried benchmarking on a similar setup -- Netflix
> data, 4 8-core machines, 20 features, etc.

Interesting, can you share a few more details what you did to achieve
that performance?

> 
> I got 178 seconds for 1 iteration, which is about as fast as the reported
> GraphLab time and nothing like the reported 9000 seconds for Hadoop-based.
> Maybe this is measuring something more than just an iteration. If I include
> all the initialization time, including reading a preprocessing and such, I
> get about 800 seconds.
> 
> 
> You can argue that the naive approach on M/R is the fair one to benchmark
> and that's what happened here maybe? But this hardly seems to rule out M/R,
> if a smarter approach and -- granted, a lot of honing -- achieves similar
> speed. You can also say that, surely, the same level of optimization work
> would probably speed up another approach even further too, and that's
> probably true.

I think the main argument of the authors is that their programming and
execution model is a much better fit for the problem.

> 
> I'm still skeptical that it's unsuitable for many things, even if you can
> surely imagine a better ideal framework for any given problem.
> 
> 
> 
> On Mon, Mar 11, 2013 at 8:45 PM, Sebastian Schelter <[email protected]> wrote:
> 
>> The GraphLab guys benchmark their ALS implementation against an old
>> version of ours and in detail describe why they can achieve a 40x to 60x
>> performance improvement. Most of the overhead is attributed to Hadoop
>> and its programming model.
>>
>> Its on the left column of Page 724 in
>> http://vldb.org/pvldb/vol5/p716_yuchenglow_vldb2012.pdf
>>
>> On 11.03.2013 21:35, Dmitriy Lyubimov wrote:
>>> Exactly! I have always thought that was the main reason why ALS in Giraph
>>> was faster.
>>>
>>> Doesn't it make strong case for a hybrid environment? Anyway, what i am
>>> saying, isn't it more or less truthful to say that in pragmatic ways ALS
>>> stuff in Mahout is lagging for the very reason of Mahout being
>> constrained
>>> to MR?
>>
>

Re: Discossuon Of ML environment/MR, Mahout

Reply via email to