The absence of Feature Level representation of data is what caused me to ask questions.
In my project, I am trying to implement Feature Sharding, which can be very useful in Parallel Online Learning. Weka has the class 'Instances' to represent a data set. Using that class, adding / deleting attributes and hence horizontal or vertical splitting of the entire data set is very easy. I could not find a way to do that in Mahout. On Wed, Aug 24, 2011 at 3:16 PM, Jake Mannix <[email protected]> wrote: > My initial inclination is -1 on adding a GPL dependency. > > Can you spell out exactly what is meant by needing a "general input format" > and "general transfer format". We currently take in raw text, and then > vectorize it. Are Vectors (with either hashed encoding, or with a > dictionary > file) not suitable as a format for some reason? > > -jake > > > On Wed, Aug 24, 2011 at 3:09 PM, Ted Dunning <[email protected]>wrote: > >> Praneet and I were just talking about a project he is working on to do >> with >> higher-order learning methods such as boosting and feature sharding. This >> is all pretty much in the context of classification and possibly >> clustering. >> >> The problems are: >> >> a) mahout doesn't have a general input format for classifiable data (this >> has been discussed recently) >> >> b) hashed vector representations are not suitable for feature sharding >> since >> individual features may be redundantly represented in many locations. >> >> c) mahout doesn't have a reasonable data structure for general data >> transfer >> (related to -a-) >> >> One possible thought is that Mahout could introduce Weka as a dependency. >> >> The virtues would be: >> >> 1) Weka has ARFF as a data format and Instance as an object to satisfy (a) >> and (c) >> >> 2) Weka provides a bunch of simple classifier algorithms which are not >> individually scalable, but might be made to be so by model averaging or >> feature sharding. >> >> 3) Praneet could finish his project very quickly. >> >> Any thoughts about this? >> >> The problems that I see with this include: >> >> A) Weka is GPL which might slow adoption of Mahout and would certainly >> inhibit direct incorporation of any piece of Weka >> >> B) Weka appears to have not caught the maven bug which makes it harder to >> add as a dependency without actually distributing the weka jar. >> >> One possible work-around might be to reverse engineer something like >> Instance and an ARFF reader/writer. >> > > -- Praneet Mhatre Graduate Student Donald Bren School of ICS University of California, Irvine
