i can even probably volunteer to do that.

On Wed, Aug 24, 2011 at 3:37 PM, Dmitriy Lyubimov <[email protected]> wrote:
> Besides, if you want to feed it directly to MR similar to sequence
> file, you'd need to do some custom splitting and an Input Format.
> Thanks to the fact it is a text format, it should be fairly easy (much
> easier than to write one for a sequence file, for example.)
>
> -d
>
> On Wed, Aug 24, 2011 at 3:35 PM, Dmitriy Lyubimov <[email protected]> wrote:
>> somewhat -1 too. Just because :)
>>
>> as far as i understand, arff just contains a way to name attributes
>> and present types others than double, which is why it is not DRM and
>> DRM is not ARFF.
>>
>> I'd rather re-engineer ARFF parser if needs be.
>>
>>
>>
>> On Wed, Aug 24, 2011 at 3:16 PM, Jake Mannix <[email protected]> wrote:
>>> My initial inclination is -1 on adding a GPL dependency.
>>>
>>> Can you spell out exactly what is meant by needing a "general input format"
>>> and "general transfer format".  We currently take in raw text, and then
>>> vectorize it.   Are Vectors (with either hashed encoding, or with a
>>> dictionary
>>> file) not suitable as a format for some reason?
>>>
>>>  -jake
>>>
>>> On Wed, Aug 24, 2011 at 3:09 PM, Ted Dunning <[email protected]> wrote:
>>>
>>>> Praneet and I were just talking about a project he is working on to do with
>>>> higher-order learning methods such as boosting and feature sharding.  This
>>>> is all pretty much in the context of classification and possibly
>>>> clustering.
>>>>
>>>> The problems are:
>>>>
>>>> a) mahout doesn't have a general input format for classifiable data (this
>>>> has been discussed recently)
>>>>
>>>> b) hashed vector representations are not suitable for feature sharding
>>>> since
>>>> individual features may be redundantly represented in many locations.
>>>>
>>>> c) mahout doesn't have a reasonable data structure for general data
>>>> transfer
>>>> (related to -a-)
>>>>
>>>> One possible thought is that Mahout could introduce Weka as a dependency.
>>>>
>>>> The virtues would be:
>>>>
>>>> 1) Weka has ARFF as a data format and Instance as an object to satisfy (a)
>>>> and (c)
>>>>
>>>> 2) Weka provides a bunch of simple classifier algorithms which are not
>>>> individually scalable, but might be made to be so by model averaging or
>>>> feature sharding.
>>>>
>>>> 3) Praneet could finish his project very quickly.
>>>>
>>>> Any thoughts about this?
>>>>
>>>> The problems that I see with this include:
>>>>
>>>> A) Weka is GPL which might slow adoption of Mahout and would certainly
>>>> inhibit direct incorporation of any piece of Weka
>>>>
>>>> B) Weka appears to have not caught the maven bug which makes it harder to
>>>> add as a dependency without actually distributing the weka jar.
>>>>
>>>> One possible work-around might be to reverse engineer something like
>>>> Instance and an ARFF reader/writer.
>>>>
>>>
>>
>

Reply via email to