Praneet and I were just talking about a project he is working on to do with higher-order learning methods such as boosting and feature sharding. This is all pretty much in the context of classification and possibly clustering.
The problems are: a) mahout doesn't have a general input format for classifiable data (this has been discussed recently) b) hashed vector representations are not suitable for feature sharding since individual features may be redundantly represented in many locations. c) mahout doesn't have a reasonable data structure for general data transfer (related to -a-) One possible thought is that Mahout could introduce Weka as a dependency. The virtues would be: 1) Weka has ARFF as a data format and Instance as an object to satisfy (a) and (c) 2) Weka provides a bunch of simple classifier algorithms which are not individually scalable, but might be made to be so by model averaging or feature sharding. 3) Praneet could finish his project very quickly. Any thoughts about this? The problems that I see with this include: A) Weka is GPL which might slow adoption of Mahout and would certainly inhibit direct incorporation of any piece of Weka B) Weka appears to have not caught the maven bug which makes it harder to add as a dependency without actually distributing the weka jar. One possible work-around might be to reverse engineer something like Instance and an ARFF reader/writer.
