I agree that the state of the MR code is something that needs to be addressed. There have been several attempts to rework/refactor it, but none of them had a satisfactory result unfortunately.

I'm hearing that there is lack for a coherent vision for the future of Mahout. Let me suggest a radical one.

- call the next release 0.10 not 1.0, as the latter implies a maturity which does not reflect the radical changes I'm proposing

- move all the MR code to a new maven module, deprecate it and announce that we delete it in the release after 0.11

- make the new DSL the heart of Mahout, aim for the following algorithms to be implemented in the DSL as a new basis:

Collaborative Filtering:

 * Cooccurrence-based recommender (work started in MAHOUT-1464)
 * ALS (work started in MAHOUT-1365)

Clustering:

 * k-Means
 * Streaming k-Means

Classification:

 * NaiveBayes (work started in MAHOUT-1493)
 * either Random Forests or an ensemble of SGD classifiers

Dimensionality Reduction / Topic Models

 * SSVD (prototype in trunk)
 * PCA (prototype in trunk)
 * LDA


- integrate Stratosphere / h20 as follows:

* the Stratosphere guys can choose to implement the physical operators of the DSL to make our algos run on Stratosphere. If they do, this is great for Mahout as it allows people to run code on different backends. If they don't, we don't lose anything.

* a major point in porting the algorithms to the DSL would be to make the input formats of all algorithms consistent. That would allow h20 to work off the same inputs the scala DSL.

Let me know what you think.

-s





On 04/06/2014 05:54 PM, Sean Owen wrote:
On Sun, Apr 6, 2014 at 4:16 PM, Andrew Musselman
<andrew.mussel...@gmail.com> wrote:
Seems to me there has been a renewed effort to eat our broccoli, along with
the other ideas people have been bringing on board.

What are you proposing to put in the board report?

I have not seen significant activity to unify or update the existing
code. It's still the same different chunks with different styles,
input/output, distributed/not, etc. The doc updates look very
positive. To be fair the task of really addressing the technical debt
is very large, so even making said dent would be a lot of work. A
clean-slate reboot therefore actually seems like a good plan, but
that's another question...

Concretely, in a board report, I personally would not agree with
representing the Spark or H2O work as an agreed future plan or
roadmap, right now. Being in the board report makes that impression,
as have recent articles/tweets I've seen, so it deserves care. That's
why I chimed in, maybe tilting at windmills.

 From where I sit with customers, the overall impression is negative
among those that have tried to use the code, and usage has gone from
few to almost none. I doubt my sample is so different from the whole
user population. Much of it is consistency/quality, but some of it's
just an interest in non-M/R frameworks.

So, I think that current state and set of problems is far more
important to acknowledge in a board report than just mentioning some
future possibilities, and the latter was the impression I got of the
likely content. In fact, it makes the talk about large upcoming
possible changes make so much more sense.


Reply via email to