My initial inclination is -1 on adding a GPL dependency. Can you spell out exactly what is meant by needing a "general input format" and "general transfer format". We currently take in raw text, and then vectorize it. Are Vectors (with either hashed encoding, or with a dictionary file) not suitable as a format for some reason?
-jake On Wed, Aug 24, 2011 at 3:09 PM, Ted Dunning <[email protected]> wrote: > Praneet and I were just talking about a project he is working on to do with > higher-order learning methods such as boosting and feature sharding. This > is all pretty much in the context of classification and possibly > clustering. > > The problems are: > > a) mahout doesn't have a general input format for classifiable data (this > has been discussed recently) > > b) hashed vector representations are not suitable for feature sharding > since > individual features may be redundantly represented in many locations. > > c) mahout doesn't have a reasonable data structure for general data > transfer > (related to -a-) > > One possible thought is that Mahout could introduce Weka as a dependency. > > The virtues would be: > > 1) Weka has ARFF as a data format and Instance as an object to satisfy (a) > and (c) > > 2) Weka provides a bunch of simple classifier algorithms which are not > individually scalable, but might be made to be so by model averaging or > feature sharding. > > 3) Praneet could finish his project very quickly. > > Any thoughts about this? > > The problems that I see with this include: > > A) Weka is GPL which might slow adoption of Mahout and would certainly > inhibit direct incorporation of any piece of Weka > > B) Weka appears to have not caught the maven bug which makes it harder to > add as a dependency without actually distributing the weka jar. > > One possible work-around might be to reverse engineer something like > Instance and an ARFF reader/writer. >
