GPL as part of our code base is pretty much a non-starter. You maybe could have come up w/ some workarounds, but see http://www.apache.org/legal/3party.html, especially the exception options.
Also, we have a basic ARFF reader in the integration module already. It has basic ARFF reading support. It would be cool if someone who had more examples and familiarity w/ ARFF were to take it up a notch. On Aug 24, 2011, at 3:09 PM, Ted Dunning wrote: > Praneet and I were just talking about a project he is working on to do with > higher-order learning methods such as boosting and feature sharding. This > is all pretty much in the context of classification and possibly clustering. > > The problems are: > > a) mahout doesn't have a general input format for classifiable data (this > has been discussed recently) > > b) hashed vector representations are not suitable for feature sharding since > individual features may be redundantly represented in many locations. > > c) mahout doesn't have a reasonable data structure for general data transfer > (related to -a-) > > One possible thought is that Mahout could introduce Weka as a dependency. > > The virtues would be: > > 1) Weka has ARFF as a data format and Instance as an object to satisfy (a) > and (c) > > 2) Weka provides a bunch of simple classifier algorithms which are not > individually scalable, but might be made to be so by model averaging or > feature sharding. > > 3) Praneet could finish his project very quickly. > > Any thoughts about this? > > The problems that I see with this include: > > A) Weka is GPL which might slow adoption of Mahout and would certainly > inhibit direct incorporation of any piece of Weka > > B) Weka appears to have not caught the maven bug which makes it harder to > add as a dependency without actually distributing the weka jar. > > One possible work-around might be to reverse engineer something like > Instance and an ARFF reader/writer. -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com
