---------- Forwarded message ---------- From: Dmitriy Lyubimov <[email protected]> Date: Mon, Mar 11, 2013 at 11:38 AM Subject: Re: Missing Mahout board report To: [email protected]
On Mon, Mar 11, 2013 at 11:27 AM, Sean Owen <[email protected]> wrote: > > > On Mon, Mar 11, 2013 at 6:06 PM, Dmitriy Lyubimov <[email protected]> > wrote: > > > > Hadoop MR platform of course, 'cause those are all flaws of the Hadoop > MR. > > So Mahout just suffers the ills of MR and that's why the flagships of > > distributed CF algorithms frankly do not shine here > > > > FWIW I think there's a big performance difference between an M/R job, and > an optimized one. It takes a lot of honing, tuning, and cheating to make > them run fast, and, that's the practical problem. But I'd hate to > necessarily conflate what's in this project with what's possible in M/R. > > > > > > > So it does call for a new distributed environment to use -- other than > "MR > > 1.0" -- if distributed stuff to be presented in Mahout on par with > > competition. I don't know how feasible that is though.ps for good. > > > > > Depends on your goal -- if building a tool for academia or for fun or for a > purpose-built project, any tool is in bounds, maybe even niche or alpha > ones. You can pick the tool that is optimal just for the problem being > solved. Hadoop is the devil people know though. If you're writing a product > / project for the broad market in 2013 I think it's still Hadoop-based. > Some of these alternatives look like they will become mature, but niche, or > broadly applicable but not mature. Most of what I'm seeing still feels to > be of the form "I solved this problem with a specialized framework and its > faster than a bad M/R implementation" which is good but not game-changing. > A generalized M/R (a la YARN) is my personal bet, but probably will be > worth building around later this year. > Sure. for many pragmatical projects Apache's MR will be just good enough. Familiarity beats additional hadrware costs; super large problems are not that common. The problem is still a little bit about how to make ALS-like stuff be practical. As far as i could recollect, Sebastian did not recommend that stuff in Mahout (as opposed to Giraph), for once it is not practical to run it enough time to figure good regularization parameter automatically. Many such problems are not just slow startup/high I/O type of things. in many cases it is about MR shuffle and sort logic itself. Imagine for a moment we wanted to solve a problem of deinterlacing an NTSC signal. So we get two fields, first one containing odd lines and the second containing even lines. MR way of solving that is to key every line with (field#, line#) and then do shuffle-and-sort. Sort component adds log to the asymptotic complexity, whereas it is clear that any streaming merge algorithm just wouldn't need to do sort and capitalize on the structure we already know . (sure, you can do it map-side with a specific streaming join logic but that would not be pure MR but rather some map task acrobatics). A lot of things we do with blocking matrix arithmetic are exactly like that. They have structure but we cannot use it and forward it appropriately at scale unless we run thru sort.
