Re: Mahout 1.0 goals

Sebastian Schelter Sat, 01 Mar 2014 05:07:21 -0800

Hi,

I think this is an important discussion to have and its good that wehave it. I wish I could say different, but I encountered a lot of theimpressions that Sean mentioned. To be honest, I don't see Mahout beingready to move to 1.0 in its current state.

I still see our main problem in failing to provide viable documentationand guidance to users. We cleaned up the wiki, but this is only a firststep. I feel that it is extremely hard for people to use a majority ofour algorithms, except if they do understand the mathematical detailsand are willing to dig through the source code. I think Mahout containsa lot of "hidden gems" that make it unique (e.g. Cooccurrence Analysiswith RowSimilarityJob, LDA with CVB, SSVD+PCA) but for the majority ofusers these gems are out of reach.

Another important aspect is that machine learning on MapReduce willvanish very soon and there's no vision to move Mahout to more suitableplatforms yet.

I think our lack of documentation causes a lack of users which stallsthe development and, together with the emergence of other platforms likeSpark, makes it hard for us to attract new people.

I must say that I think that the architecture of Oryx is really what Iwould envision for Mahout. Provide a computation layer for trainingmodels and a serving layer with a REST API or Solr for deploying them.And then abstract the training in the computation layer to enabletraining in-memory, with Hadoop, Spark, Stratosphere, you name it. I wasvery emotional when he had the discussion after Oryx was announced as aseparate project because I felt that this is what Mahout should have become.


Just my 2 cents,
Sebastian

On 02/28/2014 10:56 AM, Sean Owen wrote:

OK, your defeatism is my realism. Why has Negative Nancy intruded on
this conversation?

I have a view into many large Hadoop users. The feedback from the
minority that have tried Mahout is that it is inconsistent/unfinished
("a confederation of unrelated grad-school projects" as one put it),
buggy, and hard to use except as a few copied snippets of code. Ouch!

Only a handful that I'm aware of actually use it. Internally, there is
a perception that there is no community attention to most of the code
(see JIRA backlog). As a result -- software problems, community
issues, little demand -- it is almost certainly not going to be in our
next major packaging release, and was almost not in the current
forthcoming one.

Your Reality May Vary. This seems like yellow-flag territory for an
Apache project though, if this is representative of a wider reality.
So a conversation about whole other projects' worth of new
functionality feels quite disconnected -- red-flag territory.

To be constructive, here are four items that seem more important for
something like "1.0.0" and are even a lot less work:

- Use Hadoop .mapreduce API consistently
- Standardize input output formats of all jobs
- Remove use of deprecated code
- Clear even a third of the open JIRA backlog

(I still think it's fine to make different projects for quite
different ideas. Hadoop has another ML project, and is about to have
another other ML project. These good ideas might well better belong
there. Here, I think there is a big need for shoring up if it's even
going to survive to 1.0.)

On Thu, Feb 27, 2014 at 5:25 PM, Sean Owen <sro...@gmail.com> wrote:

I think each of several
other of these points are probably on their own several times the amount of
work that has been put into this project over the past year so I'm
wondering if this close to realistic as a to do list for 1.0 of this
project.


That is means.  I think that everything on this list is possible in
relatively short order, but let's talk goals for a bit.

What is missing here?  What really doesn't matter?

Re: Mahout 1.0 goals

Reply via email to