Besides, if you want to feed it directly to MR similar to sequence
file, you'd need to do some custom splitting and an Input Format.
Thanks to the fact it is a text format, it should be fairly easy (much
easier than to write one for a sequence file, for example.)

-d

On Wed, Aug 24, 2011 at 3:35 PM, Dmitriy Lyubimov <[email protected]> wrote:
> somewhat -1 too. Just because :)
>
> as far as i understand, arff just contains a way to name attributes
> and present types others than double, which is why it is not DRM and
> DRM is not ARFF.
>
> I'd rather re-engineer ARFF parser if needs be.
>
>
>
> On Wed, Aug 24, 2011 at 3:16 PM, Jake Mannix <[email protected]> wrote:
>> My initial inclination is -1 on adding a GPL dependency.
>>
>> Can you spell out exactly what is meant by needing a "general input format"
>> and "general transfer format".  We currently take in raw text, and then
>> vectorize it.   Are Vectors (with either hashed encoding, or with a
>> dictionary
>> file) not suitable as a format for some reason?
>>
>>  -jake
>>
>> On Wed, Aug 24, 2011 at 3:09 PM, Ted Dunning <[email protected]> wrote:
>>
>>> Praneet and I were just talking about a project he is working on to do with
>>> higher-order learning methods such as boosting and feature sharding.  This
>>> is all pretty much in the context of classification and possibly
>>> clustering.
>>>
>>> The problems are:
>>>
>>> a) mahout doesn't have a general input format for classifiable data (this
>>> has been discussed recently)
>>>
>>> b) hashed vector representations are not suitable for feature sharding
>>> since
>>> individual features may be redundantly represented in many locations.
>>>
>>> c) mahout doesn't have a reasonable data structure for general data
>>> transfer
>>> (related to -a-)
>>>
>>> One possible thought is that Mahout could introduce Weka as a dependency.
>>>
>>> The virtues would be:
>>>
>>> 1) Weka has ARFF as a data format and Instance as an object to satisfy (a)
>>> and (c)
>>>
>>> 2) Weka provides a bunch of simple classifier algorithms which are not
>>> individually scalable, but might be made to be so by model averaging or
>>> feature sharding.
>>>
>>> 3) Praneet could finish his project very quickly.
>>>
>>> Any thoughts about this?
>>>
>>> The problems that I see with this include:
>>>
>>> A) Weka is GPL which might slow adoption of Mahout and would certainly
>>> inhibit direct incorporation of any piece of Weka
>>>
>>> B) Weka appears to have not caught the maven bug which makes it harder to
>>> add as a dependency without actually distributing the weka jar.
>>>
>>> One possible work-around might be to reverse engineer something like
>>> Instance and an ARFF reader/writer.
>>>
>>
>

Reply via email to