Re: discussion of input conversions

praneet mhatre Wed, 24 Aug 2011 15:43:35 -0700

@Jake
A generalized representation of a data set in terms of its attributes and
class label i.e. before applying any form encoding on top of the data.
Something that can lie between CSV/Arff etc and Vector. Something like this
http://nlp.stanford.edu/nlp/javadoc/weka-3-2/weka.core.Instances.html ?


On Wed, Aug 24, 2011 at 3:35 PM, Dmitriy Lyubimov <[email protected]> wrote:

> somewhat -1 too. Just because :)
>
> as far as i understand, arff just contains a way to name attributes
> and present types others than double, which is why it is not DRM and
> DRM is not ARFF.
>
> I'd rather re-engineer ARFF parser if needs be.
>
>
>
> On Wed, Aug 24, 2011 at 3:16 PM, Jake Mannix <[email protected]>
> wrote:
> > My initial inclination is -1 on adding a GPL dependency.
> >
> > Can you spell out exactly what is meant by needing a "general input
> format"
> > and "general transfer format".  We currently take in raw text, and then
> > vectorize it.   Are Vectors (with either hashed encoding, or with a
> > dictionary
> > file) not suitable as a format for some reason?
> >
> >  -jake
> >
> > On Wed, Aug 24, 2011 at 3:09 PM, Ted Dunning <[email protected]>
> wrote:
> >
> >> Praneet and I were just talking about a project he is working on to do
> with
> >> higher-order learning methods such as boosting and feature sharding.
>  This
> >> is all pretty much in the context of classification and possibly
> >> clustering.
> >>
> >> The problems are:
> >>
> >> a) mahout doesn't have a general input format for classifiable data
> (this
> >> has been discussed recently)
> >>
> >> b) hashed vector representations are not suitable for feature sharding
> >> since
> >> individual features may be redundantly represented in many locations.
> >>
> >> c) mahout doesn't have a reasonable data structure for general data
> >> transfer
> >> (related to -a-)
> >>
> >> One possible thought is that Mahout could introduce Weka as a
> dependency.
> >>
> >> The virtues would be:
> >>
> >> 1) Weka has ARFF as a data format and Instance as an object to satisfy
> (a)
> >> and (c)
> >>
> >> 2) Weka provides a bunch of simple classifier algorithms which are not
> >> individually scalable, but might be made to be so by model averaging or
> >> feature sharding.
> >>
> >> 3) Praneet could finish his project very quickly.
> >>
> >> Any thoughts about this?
> >>
> >> The problems that I see with this include:
> >>
> >> A) Weka is GPL which might slow adoption of Mahout and would certainly
> >> inhibit direct incorporation of any piece of Weka
> >>
> >> B) Weka appears to have not caught the maven bug which makes it harder
> to
> >> add as a dependency without actually distributing the weka jar.
> >>
> >> One possible work-around might be to reverse engineer something like
> >> Instance and an ARFF reader/writer.
> >>
> >
>



-- 
Praneet Mhatre
Graduate Student
Donald Bren School of ICS
University of California, Irvine

Re: discussion of input conversions

Reply via email to