+1 and agree with ssc's suggestion.
Sent from my iPhone > On Apr 7, 2014, at 3:30 AM, Sebastian Schelter <s...@apache.org> wrote: > > I agree that the state of the MR code is something that needs to be > addressed. There have been several attempts to rework/refactor it, but none > of them had a satisfactory result unfortunately. > > I'm hearing that there is lack for a coherent vision for the future of > Mahout. Let me suggest a radical one. > > - call the next release 0.10 not 1.0, as the latter implies a maturity which > does not reflect the radical changes I'm proposing > > - move all the MR code to a new maven module, deprecate it and announce that > we delete it in the release after 0.11 > > - make the new DSL the heart of Mahout, aim for the following algorithms to > be implemented in the DSL as a new basis: > > Collaborative Filtering: > > * Cooccurrence-based recommender (work started in MAHOUT-1464) > * ALS (work started in MAHOUT-1365) > > Clustering: > > * k-Means > * Streaming k-Means > > Classification: > > * NaiveBayes (work started in MAHOUT-1493) > * either Random Forests or an ensemble of SGD classifiers > > Dimensionality Reduction / Topic Models > > * SSVD (prototype in trunk) > * PCA (prototype in trunk) > * LDA > > > - integrate Stratosphere / h20 as follows: > > * the Stratosphere guys can choose to implement the physical operators of the > DSL to make our algos run on Stratosphere. If they do, this is great for > Mahout as it allows people to run code on different backends. If they don't, > we don't lose anything. > > * a major point in porting the algorithms to the DSL would be to make the > input formats of all algorithms consistent. That would allow h20 to work off > the same inputs the scala DSL. > > Let me know what you think. > > -s > > > > > >> On 04/06/2014 05:54 PM, Sean Owen wrote: >> On Sun, Apr 6, 2014 at 4:16 PM, Andrew Musselman >> <andrew.mussel...@gmail.com> wrote: >>> Seems to me there has been a renewed effort to eat our broccoli, along with >>> the other ideas people have been bringing on board. >>> >>> What are you proposing to put in the board report? >> >> I have not seen significant activity to unify or update the existing >> code. It's still the same different chunks with different styles, >> input/output, distributed/not, etc. The doc updates look very >> positive. To be fair the task of really addressing the technical debt >> is very large, so even making said dent would be a lot of work. A >> clean-slate reboot therefore actually seems like a good plan, but >> that's another question... >> >> Concretely, in a board report, I personally would not agree with >> representing the Spark or H2O work as an agreed future plan or >> roadmap, right now. Being in the board report makes that impression, >> as have recent articles/tweets I've seen, so it deserves care. That's >> why I chimed in, maybe tilting at windmills. >> >> From where I sit with customers, the overall impression is negative >> among those that have tried to use the code, and usage has gone from >> few to almost none. I doubt my sample is so different from the whole >> user population. Much of it is consistency/quality, but some of it's >> just an interest in non-M/R frameworks. >> >> So, I think that current state and set of problems is far more >> important to acknowledge in a board report than just mentioning some >> future possibilities, and the latter was the impression I got of the >> likely content. In fact, it makes the talk about large upcoming >> possible changes make so much more sense. >