@Jake A generalized representation of a data set in terms of its attributes and class label i.e. before applying any form encoding on top of the data. Something that can lie between CSV/Arff etc and Vector. Something like this http://nlp.stanford.edu/nlp/javadoc/weka-3-2/weka.core.Instances.html ?
On Wed, Aug 24, 2011 at 3:35 PM, Dmitriy Lyubimov <[email protected]> wrote: > somewhat -1 too. Just because :) > > as far as i understand, arff just contains a way to name attributes > and present types others than double, which is why it is not DRM and > DRM is not ARFF. > > I'd rather re-engineer ARFF parser if needs be. > > > > On Wed, Aug 24, 2011 at 3:16 PM, Jake Mannix <[email protected]> > wrote: > > My initial inclination is -1 on adding a GPL dependency. > > > > Can you spell out exactly what is meant by needing a "general input > format" > > and "general transfer format". We currently take in raw text, and then > > vectorize it. Are Vectors (with either hashed encoding, or with a > > dictionary > > file) not suitable as a format for some reason? > > > > -jake > > > > On Wed, Aug 24, 2011 at 3:09 PM, Ted Dunning <[email protected]> > wrote: > > > >> Praneet and I were just talking about a project he is working on to do > with > >> higher-order learning methods such as boosting and feature sharding. > This > >> is all pretty much in the context of classification and possibly > >> clustering. > >> > >> The problems are: > >> > >> a) mahout doesn't have a general input format for classifiable data > (this > >> has been discussed recently) > >> > >> b) hashed vector representations are not suitable for feature sharding > >> since > >> individual features may be redundantly represented in many locations. > >> > >> c) mahout doesn't have a reasonable data structure for general data > >> transfer > >> (related to -a-) > >> > >> One possible thought is that Mahout could introduce Weka as a > dependency. > >> > >> The virtues would be: > >> > >> 1) Weka has ARFF as a data format and Instance as an object to satisfy > (a) > >> and (c) > >> > >> 2) Weka provides a bunch of simple classifier algorithms which are not > >> individually scalable, but might be made to be so by model averaging or > >> feature sharding. > >> > >> 3) Praneet could finish his project very quickly. > >> > >> Any thoughts about this? > >> > >> The problems that I see with this include: > >> > >> A) Weka is GPL which might slow adoption of Mahout and would certainly > >> inhibit direct incorporation of any piece of Weka > >> > >> B) Weka appears to have not caught the maven bug which makes it harder > to > >> add as a dependency without actually distributing the weka jar. > >> > >> One possible work-around might be to reverse engineer something like > >> Instance and an ARFF reader/writer. > >> > > > -- Praneet Mhatre Graduate Student Donald Bren School of ICS University of California, Irvine
