MLlib may be less production tested than Mahout that is true, but I would say Spark is heavily production tested and getting close to a true 1.0 release. Why do you favour Hadoop for "sturdiness"? Spark uses HDFS as an input source (or any Hadoop InputFormat) so benefits from the same fault tolerance wrt input sources. Spark's fault tolerance model for tasks / jobs is if anything superior to Hadoop M/R.
For a Downpour SGD-like implementation on Spark see: https://github.com/apache/incubator-spark/pull/407. Assuming the framework for Spark SGD / gradients etc is flexible enough, one should be able to implement neural net / perceptron on top of this. Would be interested to hear if it can be done easily with the current code framework. On Wed, Feb 19, 2014 at 11:55 PM, peng <[email protected]> wrote: > I was suggested to switch to MLlib for its performance, but I doubt if > that is production ready, even if it is I would still favour hadoop's > sturdiness and self-healing. > But maybe mahout can include contribs that M/R is not fit for, like > downpour SGD or graph-based algorithms? > > > On Wed 19 Feb 2014 07:52:22 AM EST, Sean Owen wrote: > >> To set expectations appropriately, I think it's important to point out >> this is completely infeasible short of a total rewrite, and I can't >> imagine that will happen. It may not be obvious if you haven't looked >> at the code how completely dependent on M/R it is. >> >> You can swap out M/R and Spark if you write in terms of something like >> Crunch, but that is not at all the case here. >> >> On Wed, Feb 19, 2014 at 12:43 PM, Jay Vyas <[email protected]> wrote: >> >>> +100 for this, different execution engines, like the direction pig and >>> crunch take >>> >>> Sent from my iPhone >>> >>> On Feb 19, 2014, at 5:19 AM, Gokhan Capan <[email protected]> wrote: >>>> >>>> I imagine in Mahout offering an option to the users to select from >>>> different execution engines (just like we currently do by giving M/R or >>>> sequential options), and starting from Spark. I am not sure what changes >>>> needed in the codebase, though. Maybe following MLI (or alike) and >>>> implementing some more stuff, such as common interfaces for iterating >>>> over >>>> data (the M/R way and the Spark way). >>>> >>>> IMO, another effort might be porting pre-online machine learning (such >>>> transforming text into vector based on the dictionary generated by >>>> seq2sparse before), machine learning based on mini-batches, and >>>> streaming >>>> summarization stuff in Mahout to Spark-Streaming. >>>> >>>> Best, >>>> Gokhan >>>> >>>> On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov <[email protected] >>>> >wrote: >>>> >>>> PS I am moving along cost optimizer for spark-backed DRMs on some >>>>> multiplicative pipelines that is capable of figuring different >>>>> cost-based >>>>> rewrites and R-Like DSL that mixes in-core and distributed matrix >>>>> representations and blocks but it is painfully slow, i really only >>>>> doing it >>>>> like couple nights in a month. It does not look like i will be doing >>>>> it on >>>>> company time any time soon (and even if i did, the company doesn't >>>>> seem to >>>>> be inclined to contribute anything I do anything new on their time). >>>>> It is >>>>> all painfully slow, there's no direct funding for it anywhere with no >>>>> string attached. That probably will be primary reason why Mahout would >>>>> not >>>>> be able to get much traction compared to university-based >>>>> contributions. >>>>> >>>>> >>>>> On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov <[email protected] >>>>> >>>>>> wrote: >>>>>> >>>>> >>>>> Unfortunately methinks the prospects of something like Mahout/MLLib >>>>>> merge >>>>>> seem very unlikely due to vastly diverged approach to the basics of >>>>>> >>>>> linear >>>>> >>>>>> algebra (and other things). Just like one cannot grow single tree out >>>>>> of >>>>>> two trunks -- not easily, anyway. >>>>>> >>>>>> It is fairly easy to port (and subsequently beat) MLib at this point >>>>>> from >>>>>> collection of algorithms point of view. But IMO goal should be more >>>>>> MLI-like first, and port second. And be very careful with concepts. >>>>>> Something that i so far don't see happening with MLib. MLib seems to >>>>>> be >>>>>> old-style Mahout-like rush to become a collection of basic algorithms >>>>>> rather than coherent foundation. Admittedly, i havent looked very >>>>>> >>>>> closely. >>>>> >>>>>> >>>>>> >>>>>> On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter <[email protected] >>>>>> wrote: >>>>>> >>>>>> I'm also convinced that Spark is a superior platform for executing >>>>>>> distributed ML algorithms. We've had a discussion about a change from >>>>>>> Hadoop to another platform some time ago, but at that point in time >>>>>>> it >>>>>>> >>>>>> was >>>>> >>>>>> not clear which of the upcoming dataflow processing systems (Spark, >>>>>>> Hyracks, Stratosphere) would establish itself amongst the users. To >>>>>>> me >>>>>>> >>>>>> it >>>>> >>>>>> seems pretty obvious that Spark made the race. >>>>>>> >>>>>>> I concur with Ted, it would be great to have the communities work >>>>>>> together. I know that at least 4 mahout committers (including me) are >>>>>>> already following Spark's mailinglist and actively participating in >>>>>>> the >>>>>>> discussions. >>>>>>> >>>>>>> What are the ideas how a fruitful cooperation look like? >>>>>>> >>>>>>> Best, >>>>>>> Sebastian >>>>>>> >>>>>>> PS: >>>>>>> >>>>>>> I ported LLR-based cooccurrence analysis (aka item-based >>>>>>> recommendation) >>>>>>> to Spark some time ago, but I haven't had time to test my code on a >>>>>>> >>>>>> large >>>>> >>>>>> dataset yet. I'd be happy to see someone help with that. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 02/19/2014 08:04 AM, Nick Pentreath wrote: >>>>>>>> >>>>>>>> I know the Spark/Mllib devs can occasionally be quite set in ways of >>>>>>>> doing certain things, but we'd welcome as many Mahout devs as >>>>>>>> possible >>>>>>>> >>>>>>> to >>>>> >>>>>> work together. >>>>>>>> >>>>>>>> >>>>>>>> It may be too late, but perhaps a GSoC project to look at a port of >>>>>>>> >>>>>>> some >>>>> >>>>>> stuff like co occurrence recommender and streaming k-means? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> N >>>>>>>> -- >>>>>>>> Sent from Mailbox for iPhone >>>>>>>> >>>>>>>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning <[email protected] >>>>>>>> > >>>>>>>> wrote: >>>>>>>> >>>>>>>> On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath < >>>>>>>> >>>>>>>>> [email protected]>wrote: >>>>>>>>> >>>>>>>>> My (admittedly heavily biased) view is Spark is a superior >>>>>>>>>> platform >>>>>>>>>> overall >>>>>>>>>> for ML. If the two communities can work together to leverage the >>>>>>>>>> strengths >>>>>>>>>> of Spark, and the large amount of good stuff in Mahout (as well as >>>>>>>>>> >>>>>>>>> the >>>>> >>>>>> fantastic depth of experience of Mahout devs) I think a lot can be >>>>>>>>>> achieved! >>>>>>>>>> >>>>>>>>>> It makes a lot of sense that Spark would be better than Hadoop for >>>>>>>>>> >>>>>>>>> ML >>>>> >>>>>> purposes given that Hadoop was intended to do web-crawl kinds of >>>>>>>>> >>>>>>>> things >>>>> >>>>>> and >>>>>>>>> Spark was intentionally built to support machine learning. >>>>>>>>> Given that Spark has been announced by a majority of the >>>>>>>>> Hadoop-based >>>>>>>>> distribution vendors, it makes sense that maybe Mahout should jump >>>>>>>>> in. >>>>>>>>> I really would prefer it if the two communities (MLib/MLI and >>>>>>>>> Mahout) >>>>>>>>> could >>>>>>>>> work more closely together. There is a lot of good to be had on >>>>>>>>> both >>>>>>>>> sides. >>>>>>>>> >>>>>>>> >>>>>
