Re: discussion of input conversions

Jake Mannix Wed, 24 Aug 2011 15:33:34 -0700

What is a "feature level representation" of the data?  Can you be more
specific?


On Wed, Aug 24, 2011 at 3:25 PM, praneet mhatre <[email protected]>wrote:

> The absence of Feature Level representation of data is what caused me to
> ask questions.
>
> In my project, I am trying to implement Feature Sharding, which can be very
> useful in Parallel Online Learning. Weka has the class 'Instances' to
> represent a data set. Using that class, adding / deleting attributes and
> hence horizontal or vertical splitting of the entire data set is very easy.
> I could not find a way to do that in Mahout.
>
>
> On Wed, Aug 24, 2011 at 3:16 PM, Jake Mannix <[email protected]>wrote:
>
>> My initial inclination is -1 on adding a GPL dependency.
>>
>> Can you spell out exactly what is meant by needing a "general input
>> format"
>> and "general transfer format".  We currently take in raw text, and then
>> vectorize it.   Are Vectors (with either hashed encoding, or with a
>> dictionary
>> file) not suitable as a format for some reason?
>>
>>   -jake
>>
>>
>> On Wed, Aug 24, 2011 at 3:09 PM, Ted Dunning <[email protected]>wrote:
>>
>>> Praneet and I were just talking about a project he is working on to do
>>> with
>>> higher-order learning methods such as boosting and feature sharding.
>>>  This
>>> is all pretty much in the context of classification and possibly
>>> clustering.
>>>
>>> The problems are:
>>>
>>> a) mahout doesn't have a general input format for classifiable data (this
>>> has been discussed recently)
>>>
>>> b) hashed vector representations are not suitable for feature sharding
>>> since
>>> individual features may be redundantly represented in many locations.
>>>
>>> c) mahout doesn't have a reasonable data structure for general data
>>> transfer
>>> (related to -a-)
>>>
>>> One possible thought is that Mahout could introduce Weka as a dependency.
>>>
>>> The virtues would be:
>>>
>>> 1) Weka has ARFF as a data format and Instance as an object to satisfy
>>> (a)
>>> and (c)
>>>
>>> 2) Weka provides a bunch of simple classifier algorithms which are not
>>> individually scalable, but might be made to be so by model averaging or
>>> feature sharding.
>>>
>>> 3) Praneet could finish his project very quickly.
>>>
>>> Any thoughts about this?
>>>
>>> The problems that I see with this include:
>>>
>>> A) Weka is GPL which might slow adoption of Mahout and would certainly
>>> inhibit direct incorporation of any piece of Weka
>>>
>>> B) Weka appears to have not caught the maven bug which makes it harder to
>>> add as a dependency without actually distributing the weka jar.
>>>
>>> One possible work-around might be to reverse engineer something like
>>> Instance and an ARFF reader/writer.
>>>
>>
>>
>
>
> --
> Praneet Mhatre
> Graduate Student
> Donald Bren School of ICS
> University of California, Irvine
>
>

Re: discussion of input conversions

Reply via email to