Hi Dr Dunning,

I'm reluctant to admit that my feeling is similar to many of Sean's customers. as a user of mahout and lucene-solr, I see a lot of similarities in their cases:
lucene | mahout
indexing takes text as sparse vectors and build inverted index | training takes data as sparse vectors and build model
inverted index exist in memory/HDFS | model exist in memory/HDFS
use by input text and return match with scores | use by input test data and return scores/labels do model selection by comparing ordinal number of scores with ground truth | do model selection by comparing scores/labels with ground truth

Then lucene/solr/elasticsearch evolved to become most successful flagship products (as buggy and incomplete as it is, it still gain wide usage which mahout never achieved). Yet mahout still looks like being assembled by glue and duct tape. The major difficulties I encountered are:

1. Components are not interchangable: e.g. the data and model presentation for single-node CF is vastly different from MR CF. New feature sometimes add backward-incompatible presentation. This drastically demoralized user seeking to integrate with it and expecting improvement. 2. Components have strong dependency on others: e.g. Cross-validation of CF can only use in-memory DataModel, which SlopeOneRecommender cannot update properly (its removed but you got my point). Such design never draw enough attention apart from an 'won't fix' solution. 3. Many models can only be used internally, cannot be exported or reused in other applications. This is true in solr as well but its restful api is very universal and many etl tools has been built for it. In contrast mahout has a very hard learning curve for non-java developers.

its not bad t see mahout as a service on top of a library, if it doesn't take too much effort.

Yours Peng

On Sun 02 Mar 2014 11:45:33 AM EST, Ted Dunning wrote:
Ravi,

Good points.

On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla <ravi.mummu...@gmail.com>wrote:

- Natively support Windows (guidance, etc. No documentation exists today,
for instance)


There is a bit of demand for that.

- Faster time to first application (from discovery to first application
currently takes a non-trivial amount of effort; how can we lower the bar
and reduce the friction for adoption?)


There is huge evidence that this is important.


  - Better documenting use cases with working samples/examples
(Documentation
on https://mahout.apache.org/users/basics/algorithms.html is spread out
and
there is too much focus on algorithms as opposed to use cases - this is an
adoption blocker)


This is also important.


- Uniformity of the API set across all algorithms (are we providing the
same experience across all APIs?)


And many people have been tripped up by this.


  - Measuring/publishing scalability metrics of various algorithms (why
would
we want users to adopt Mahout vs. other frameworks for ML at scale?)


I don't see this as important as some of your other points, but is still
useful.

Reply via email to