On 12.03.2013 23:21, Sean Owen wrote: > I'm trying to separate what a particular implementation happens to do on > framework X from what is possible, and with what difficulty. I'm not > convinced that this example, maybe others, shows M/R is so bad. The > existence of a slow M/R job doesn't prove there isn't a good one. > > For example, the paper says: > > We can observe that the Map function of a Hadoop ALS implementation, > performs no computation and its only purpose is to emit copies of the > vertex data for every edge in the graph; unnecessarily multiplying > the amount of data that need to be tracked. > For example, a user vertex that connects to 100 movies must > emit the data on the user vertex 100 times, once for each movie > > Yes, that's a bad way to do it. I don't even think that's how the Mahout > version does it? I thought it side-loaded one of the matrices rather than > join like this. That's "cheating" and takes work, but it's much faster than > the above. Side-loading stuff starts to make M/R workers into longer-lived > entities holding a lot of data in memory, which is more like how other > frameworks behave. Indeed it helps a lot.
It's not a bad way per se, its a repartition join. Side-loading means doing a broadcast join, which is more performant but only works as long as the broadcasted side fits into the receivers memory. You're right, we removed the repartition variant. > > I've put a lot of work into an ALS implementation on M/R. It ought to be > not far off optimal. So I tried benchmarking on a similar setup -- Netflix > data, 4 8-core machines, 20 features, etc. Interesting, can you share a few more details what you did to achieve that performance? > > I got 178 seconds for 1 iteration, which is about as fast as the reported > GraphLab time and nothing like the reported 9000 seconds for Hadoop-based. > Maybe this is measuring something more than just an iteration. If I include > all the initialization time, including reading a preprocessing and such, I > get about 800 seconds. > > > You can argue that the naive approach on M/R is the fair one to benchmark > and that's what happened here maybe? But this hardly seems to rule out M/R, > if a smarter approach and -- granted, a lot of honing -- achieves similar > speed. You can also say that, surely, the same level of optimization work > would probably speed up another approach even further too, and that's > probably true. I think the main argument of the authors is that their programming and execution model is a much better fit for the problem. > > I'm still skeptical that it's unsuitable for many things, even if you can > surely imagine a better ideal framework for any given problem. > > > > On Mon, Mar 11, 2013 at 8:45 PM, Sebastian Schelter <[email protected]> wrote: > >> The GraphLab guys benchmark their ALS implementation against an old >> version of ours and in detail describe why they can achieve a 40x to 60x >> performance improvement. Most of the overhead is attributed to Hadoop >> and its programming model. >> >> Its on the left column of Page 724 in >> http://vldb.org/pvldb/vol5/p716_yuchenglow_vldb2012.pdf >> >> On 11.03.2013 21:35, Dmitriy Lyubimov wrote: >>> Exactly! I have always thought that was the main reason why ALS in Giraph >>> was faster. >>> >>> Doesn't it make strong case for a hybrid environment? Anyway, what i am >>> saying, isn't it more or less truthful to say that in pragmatic ways ALS >>> stuff in Mahout is lagging for the very reason of Mahout being >> constrained >>> to MR? >> >
