I'm trying to separate what a particular implementation happens to do on framework X from what is possible, and with what difficulty. I'm not convinced that this example, maybe others, shows M/R is so bad. The existence of a slow M/R job doesn't prove there isn't a good one.
For example, the paper says: We can observe that the Map function of a Hadoop ALS implementation, performs no computation and its only purpose is to emit copies of the vertex data for every edge in the graph; unnecessarily multiplying the amount of data that need to be tracked. For example, a user vertex that connects to 100 movies must emit the data on the user vertex 100 times, once for each movie Yes, that's a bad way to do it. I don't even think that's how the Mahout version does it? I thought it side-loaded one of the matrices rather than join like this. That's "cheating" and takes work, but it's much faster than the above. Side-loading stuff starts to make M/R workers into longer-lived entities holding a lot of data in memory, which is more like how other frameworks behave. Indeed it helps a lot. I've put a lot of work into an ALS implementation on M/R. It ought to be not far off optimal. So I tried benchmarking on a similar setup -- Netflix data, 4 8-core machines, 20 features, etc. I got 178 seconds for 1 iteration, which is about as fast as the reported GraphLab time and nothing like the reported 9000 seconds for Hadoop-based. Maybe this is measuring something more than just an iteration. If I include all the initialization time, including reading a preprocessing and such, I get about 800 seconds. You can argue that the naive approach on M/R is the fair one to benchmark and that's what happened here maybe? But this hardly seems to rule out M/R, if a smarter approach and -- granted, a lot of honing -- achieves similar speed. You can also say that, surely, the same level of optimization work would probably speed up another approach even further too, and that's probably true. I'm still skeptical that it's unsuitable for many things, even if you can surely imagine a better ideal framework for any given problem. On Mon, Mar 11, 2013 at 8:45 PM, Sebastian Schelter <[email protected]> wrote: > The GraphLab guys benchmark their ALS implementation against an old > version of ours and in detail describe why they can achieve a 40x to 60x > performance improvement. Most of the overhead is attributed to Hadoop > and its programming model. > > Its on the left column of Page 724 in > http://vldb.org/pvldb/vol5/p716_yuchenglow_vldb2012.pdf > > On 11.03.2013 21:35, Dmitriy Lyubimov wrote: > > Exactly! I have always thought that was the main reason why ALS in Giraph > > was faster. > > > > Doesn't it make strong case for a hybrid environment? Anyway, what i am > > saying, isn't it more or less truthful to say that in pragmatic ways ALS > > stuff in Mahout is lagging for the very reason of Mahout being > constrained > > to MR? >
