Hi Dr Dunning,
I'm reluctant to admit that my feeling is similar to many of Sean's
customers. as a user of mahout and lucene-solr, I see a lot of
similarities in their cases:
lucene | mahout
indexing takes text as sparse vectors and build inverted index |
training takes data as sparse vectors and build model
inverted index exist in memory/HDFS | model exist in memory/HDFS
use by input text and return match with scores | use by input test data
and return scores/labels
do model selection by comparing ordinal number of scores with ground
truth | do model selection by comparing scores/labels with ground truth
Then lucene/solr/elasticsearch evolved to become most successful
flagship products (as buggy and incomplete as it is, it still gain wide
usage which mahout never achieved). Yet mahout still looks like being
assembled by glue and duct tape. The major difficulties I encountered
are:
1. Components are not interchangable: e.g. the data and model
presentation for single-node CF is vastly different from MR CF. New
feature sometimes add backward-incompatible presentation. This
drastically demoralized user seeking to integrate with it and expecting
improvement.
2. Components have strong dependency on others: e.g. Cross-validation
of CF can only use in-memory DataModel, which SlopeOneRecommender
cannot update properly (its removed but you got my point). Such design
never draw enough attention apart from an 'won't fix' solution.
3. Many models can only be used internally, cannot be exported or
reused in other applications. This is true in solr as well but its
restful api is very universal and many etl tools has been built for it.
In contrast mahout has a very hard learning curve for non-java
developers.
its not bad t see mahout as a service on top of a library, if it
doesn't take too much effort.
Yours Peng
On Sun 02 Mar 2014 11:45:33 AM EST, Ted Dunning wrote:
Ravi,
Good points.
On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla <ravi.mummu...@gmail.com>wrote:
- Natively support Windows (guidance, etc. No documentation exists today,
for instance)
There is a bit of demand for that.
- Faster time to first application (from discovery to first application
currently takes a non-trivial amount of effort; how can we lower the bar
and reduce the friction for adoption?)
There is huge evidence that this is important.
- Better documenting use cases with working samples/examples
(Documentation
on https://mahout.apache.org/users/basics/algorithms.html is spread out
and
there is too much focus on algorithms as opposed to use cases - this is an
adoption blocker)
This is also important.
- Uniformity of the API set across all algorithms (are we providing the
same experience across all APIs?)
And many people have been tripped up by this.
- Measuring/publishing scalability metrics of various algorithms (why
would
we want users to adopt Mahout vs. other frameworks for ML at scale?)
I don't see this as important as some of your other points, but is still
useful.