Hi Josh,
On Tue, Mar 12, 2013 at 3:05 PM, Josh Wills <[email protected]> wrote: > Hey Mahout Devs, > > > First, I wanted to say that I think that there are lots of problems that > can be handled well in MapReduce (the recent k-means streaming stuff being > a prime example), even if they could be performed even faster using an > in-memory model. Yeah. The stuck point for me is page-rankish-finding stationary distributions and extremely popular ALS based stuff. We've beaten the heck out of it a year ago and Sebastian conclusively stated Giraph ALS knocks the socks off MR version. Add to that a bisect search for a good regularization, and number of iterations needed multiply -- that's one thing we actually eventually dropped completely as far as i can remember for that very reason. Parallel search is also possible but it would require far more crunching than the bisect one. Even with shark i am able to do ETL like things under 5s on a dataset that the same hive query takes minutes to complete. There's absolutely no discussion about that for some of our exploratory requirements MR just doesn't cut it. Especially for problems with high number of steps, well defined blocking structure and interconnectedness. Also. Batch stuff is 95% ETL / variable prep. There's absolutely no argument that batch systems such as Crunch and Cascading and Pig will stay absolutely valuable in that problem space. If we throw away those 95% however, and keep realtively condensed ML problems, this at least for 50% of real life problems lands at medium size, cpu-bound datasets with high interconnectedness and high amount supersteps. Just as Ted mentioned, i found it to be it increasingly pragmatically the case. I had to eat a lot of dung with SSVD in order to manage the problem of interconnectedness and fit it into small information pieces so that side-load is not apparent at scale, but even that approach still has it and will undoubtedly run more efficiently once combined with scatter/distributed memory techniques. I found some operational problems with spark (notably, task cleanup in workers), although it is very likely to be me somehow being stupid or running a lot of code that breaks. But i surprisingly found myself swearing a lot less than i did so about some of hadoop ecosystem products operationally. I have still not had production trial though to be honest. We'll see. > > I'm wondering if we could solve both problems by creating a wrapper API > that would look a lot like the Spark RDD API and then created > implementations of that API via Spark/Giraph/Crunch (truly shameless > promotion: http://crunch.apache.org/ ). That way, the same model could be > run in-memory against Spark or in batch via a series of MapReduce jobs (or > a BSP job, or a Tez pipeline, or whatever execution framework is written > next week.) I considered similar things before. One problem with this approach the way i see it is that usually you have to create some sort of format bridging, or at the very least persist the results between changing approach gears. One of the nicest thing about spark is that datasets are already partitioned in memory, not going anywhere, and ready to launch either superstep or mapper the next nanosecond without any additional ado.They are giving it all in the same giftwrap. And yes, there's this scatter thing vs. forced sorting thing, but oh well. And yes they are just as operationally compatible with hadoop stuff as regular MR. Same hdfs location affinity. I just recently finished a parallel RDD pulling simultaneously data from hbase corprocessor-based spatial scan with the whole region server location affinity thing, very easy to integrate. (well, easy integrate on spark side; the endpoint coprocessor api on pre-0.96 hbase is woefully lacking streaming, so.. but even then... ) > The main virtue of Crunch in this regard is that the data model > is very similar to Spark's (truth be told, I used the Spark API as a > reference when I was originally creating Crunch's Scala API, "Scrunch")-- > the whole idea of distributed collections of "things," where things can be > anything you want them to be (e.g., Mahout Vectors.) > My opinion is that Crunch is no better or worse that RDD api but lacks bulk parallel operations (as it stands today obviously). But argument about subset similarity is far from an argument about superiority. FWIW i believe Crunch api is far better than Cascading for integrating with Mahout's vectors (and better suited for custom type integration such as R types) which is why i was, and still is, interested in that direction; but spark adoption is at the very least just as easy. > I don't have an opinion on the structure of such an effort (via the > incubator or otherwise), but I thought I would throw the idea out there, as > it's something I would definitely like to be involved in. I support to extend Mahout to any end (on the side first, module-level isolation). In the end everything that is added, either wins by being used, or it will stagnate, but without fresh attempts it is hard to expect any results at all. Even sheer fact of this discussion existence is a testament that Mahout may need to try something new operationally to maintain and extend its acceptance. But i think the agreement has always been to maintain high entry criteria for the new stuff with a good usage prospects. Just adding stuff for the sake of adding stuff to the mainstream seemed to have a trend to dissolve the project cohesiveness in the past. -d
