What is a "feature level representation" of the data? Can you be more specific?
On Wed, Aug 24, 2011 at 3:25 PM, praneet mhatre <[email protected]>wrote: > The absence of Feature Level representation of data is what caused me to > ask questions. > > In my project, I am trying to implement Feature Sharding, which can be very > useful in Parallel Online Learning. Weka has the class 'Instances' to > represent a data set. Using that class, adding / deleting attributes and > hence horizontal or vertical splitting of the entire data set is very easy. > I could not find a way to do that in Mahout. > > > On Wed, Aug 24, 2011 at 3:16 PM, Jake Mannix <[email protected]>wrote: > >> My initial inclination is -1 on adding a GPL dependency. >> >> Can you spell out exactly what is meant by needing a "general input >> format" >> and "general transfer format". We currently take in raw text, and then >> vectorize it. Are Vectors (with either hashed encoding, or with a >> dictionary >> file) not suitable as a format for some reason? >> >> -jake >> >> >> On Wed, Aug 24, 2011 at 3:09 PM, Ted Dunning <[email protected]>wrote: >> >>> Praneet and I were just talking about a project he is working on to do >>> with >>> higher-order learning methods such as boosting and feature sharding. >>> This >>> is all pretty much in the context of classification and possibly >>> clustering. >>> >>> The problems are: >>> >>> a) mahout doesn't have a general input format for classifiable data (this >>> has been discussed recently) >>> >>> b) hashed vector representations are not suitable for feature sharding >>> since >>> individual features may be redundantly represented in many locations. >>> >>> c) mahout doesn't have a reasonable data structure for general data >>> transfer >>> (related to -a-) >>> >>> One possible thought is that Mahout could introduce Weka as a dependency. >>> >>> The virtues would be: >>> >>> 1) Weka has ARFF as a data format and Instance as an object to satisfy >>> (a) >>> and (c) >>> >>> 2) Weka provides a bunch of simple classifier algorithms which are not >>> individually scalable, but might be made to be so by model averaging or >>> feature sharding. >>> >>> 3) Praneet could finish his project very quickly. >>> >>> Any thoughts about this? >>> >>> The problems that I see with this include: >>> >>> A) Weka is GPL which might slow adoption of Mahout and would certainly >>> inhibit direct incorporation of any piece of Weka >>> >>> B) Weka appears to have not caught the maven bug which makes it harder to >>> add as a dependency without actually distributing the weka jar. >>> >>> One possible work-around might be to reverse engineer something like >>> Instance and an ARFF reader/writer. >>> >> >> > > > -- > Praneet Mhatre > Graduate Student > Donald Bren School of ICS > University of California, Irvine > >
