Based on some discussion on the private group about where Mahout is faltering in the real world, a stream of thought bubbled up - Make Mahout leaner. i.e push the best stuff we have to the top and prune out algorithms that are underperforming. The main issue here is that Iterative nature of many of the algorithms make it inefficient to be implemented on top of current Hadoop. The summary or the state of the disucssion so far
1) Focus on large scale data(not medium scale) and focus on algorithms that run at *almost* O(n). 2) Focus on deployability and less on making it an analysis tool for data competitions. 3) Prune prune prune things that are not being maintained. The following is one way of looking at Mahout and the state of its algorithms. Let us know if you would like something to be in the keeper category. Keepers 1. Recommenders -- clearly a keeper 2. SGD 3. LDA 4. Some clustering (with upgrades) 5. Math + collections 6. Hadoop Utilities + Integration -- I know it's silly, but things like sequence file dumper, the iterators, etc. are handy in a number of places. 7. SVD and related 8 RowSimilarity 9. Some of the upfront preprocessing tools (Lucene, Text , etc.) Unsure: - Bayes + Random Forest - Seems a shame on bayes, since it gives a baseline, but I don't know that it actually works and then there's the whole split personality nature of it (text-based and vector-based) - Collocations - I'd say keep for now, even if just for selfish reasons - Minhash - every time I look at it is seems broken and the original author doesn't respond to requests for explanation. - Freq. Item Set - Tom's done some work to clean up and I've tried it on search logs and the results looked OK, but no formal evaluation. I've seen others say why not just do simpler co-occurrence stuff... Drop for sure: 1. Watchmaker 2. Unused/poor examples 3. Probably a lot more that escapes me at the moment. 4. PageRank ------ Robin Anil
