Re: discussion of input conversions

Dmitriy Lyubimov Wed, 24 Aug 2011 15:37:54 -0700

Besides, if you want to feed it directly to MR similar to sequence
file, you'd need to do some custom splitting and an Input Format.
Thanks to the fact it is a text format, it should be fairly easy (much
easier than to write one for a sequence file, for example.)


-d

On Wed, Aug 24, 2011 at 3:35 PM, Dmitriy Lyubimov <[email protected]> wrote:
> somewhat -1 too. Just because :)
>
> as far as i understand, arff just contains a way to name attributes
> and present types others than double, which is why it is not DRM and
> DRM is not ARFF.
>
> I'd rather re-engineer ARFF parser if needs be.
>
>
>
> On Wed, Aug 24, 2011 at 3:16 PM, Jake Mannix <[email protected]> wrote:
>> My initial inclination is -1 on adding a GPL dependency.
>>
>> Can you spell out exactly what is meant by needing a "general input format"
>> and "general transfer format".  We currently take in raw text, and then
>> vectorize it.   Are Vectors (with either hashed encoding, or with a
>> dictionary
>> file) not suitable as a format for some reason?
>>
>>  -jake
>>
>> On Wed, Aug 24, 2011 at 3:09 PM, Ted Dunning <[email protected]> wrote:
>>
>>> Praneet and I were just talking about a project he is working on to do with
>>> higher-order learning methods such as boosting and feature sharding.  This
>>> is all pretty much in the context of classification and possibly
>>> clustering.
>>>
>>> The problems are:
>>>
>>> a) mahout doesn't have a general input format for classifiable data (this
>>> has been discussed recently)
>>>
>>> b) hashed vector representations are not suitable for feature sharding
>>> since
>>> individual features may be redundantly represented in many locations.
>>>
>>> c) mahout doesn't have a reasonable data structure for general data
>>> transfer
>>> (related to -a-)
>>>
>>> One possible thought is that Mahout could introduce Weka as a dependency.
>>>
>>> The virtues would be:
>>>
>>> 1) Weka has ARFF as a data format and Instance as an object to satisfy (a)
>>> and (c)
>>>
>>> 2) Weka provides a bunch of simple classifier algorithms which are not
>>> individually scalable, but might be made to be so by model averaging or
>>> feature sharding.
>>>
>>> 3) Praneet could finish his project very quickly.
>>>
>>> Any thoughts about this?
>>>
>>> The problems that I see with this include:
>>>
>>> A) Weka is GPL which might slow adoption of Mahout and would certainly
>>> inhibit direct incorporation of any piece of Weka
>>>
>>> B) Weka appears to have not caught the maven bug which makes it harder to
>>> add as a dependency without actually distributing the weka jar.
>>>
>>> One possible work-around might be to reverse engineer something like
>>> Instance and an ARFF reader/writer.
>>>
>>
>

Re: discussion of input conversions

Reply via email to