i can even probably volunteer to do that.
On Wed, Aug 24, 2011 at 3:37 PM, Dmitriy Lyubimov <[email protected]> wrote: > Besides, if you want to feed it directly to MR similar to sequence > file, you'd need to do some custom splitting and an Input Format. > Thanks to the fact it is a text format, it should be fairly easy (much > easier than to write one for a sequence file, for example.) > > -d > > On Wed, Aug 24, 2011 at 3:35 PM, Dmitriy Lyubimov <[email protected]> wrote: >> somewhat -1 too. Just because :) >> >> as far as i understand, arff just contains a way to name attributes >> and present types others than double, which is why it is not DRM and >> DRM is not ARFF. >> >> I'd rather re-engineer ARFF parser if needs be. >> >> >> >> On Wed, Aug 24, 2011 at 3:16 PM, Jake Mannix <[email protected]> wrote: >>> My initial inclination is -1 on adding a GPL dependency. >>> >>> Can you spell out exactly what is meant by needing a "general input format" >>> and "general transfer format". We currently take in raw text, and then >>> vectorize it. Are Vectors (with either hashed encoding, or with a >>> dictionary >>> file) not suitable as a format for some reason? >>> >>> -jake >>> >>> On Wed, Aug 24, 2011 at 3:09 PM, Ted Dunning <[email protected]> wrote: >>> >>>> Praneet and I were just talking about a project he is working on to do with >>>> higher-order learning methods such as boosting and feature sharding. This >>>> is all pretty much in the context of classification and possibly >>>> clustering. >>>> >>>> The problems are: >>>> >>>> a) mahout doesn't have a general input format for classifiable data (this >>>> has been discussed recently) >>>> >>>> b) hashed vector representations are not suitable for feature sharding >>>> since >>>> individual features may be redundantly represented in many locations. >>>> >>>> c) mahout doesn't have a reasonable data structure for general data >>>> transfer >>>> (related to -a-) >>>> >>>> One possible thought is that Mahout could introduce Weka as a dependency. >>>> >>>> The virtues would be: >>>> >>>> 1) Weka has ARFF as a data format and Instance as an object to satisfy (a) >>>> and (c) >>>> >>>> 2) Weka provides a bunch of simple classifier algorithms which are not >>>> individually scalable, but might be made to be so by model averaging or >>>> feature sharding. >>>> >>>> 3) Praneet could finish his project very quickly. >>>> >>>> Any thoughts about this? >>>> >>>> The problems that I see with this include: >>>> >>>> A) Weka is GPL which might slow adoption of Mahout and would certainly >>>> inhibit direct incorporation of any piece of Weka >>>> >>>> B) Weka appears to have not caught the maven bug which makes it harder to >>>> add as a dependency without actually distributing the weka jar. >>>> >>>> One possible work-around might be to reverse engineer something like >>>> Instance and an ARFF reader/writer. >>>> >>> >> >
