My initial inclination is -1 on adding a GPL dependency.

Can you spell out exactly what is meant by needing a "general input format"
and "general transfer format".  We currently take in raw text, and then
vectorize it.   Are Vectors (with either hashed encoding, or with a
dictionary
file) not suitable as a format for some reason?

  -jake

On Wed, Aug 24, 2011 at 3:09 PM, Ted Dunning <[email protected]> wrote:

> Praneet and I were just talking about a project he is working on to do with
> higher-order learning methods such as boosting and feature sharding.  This
> is all pretty much in the context of classification and possibly
> clustering.
>
> The problems are:
>
> a) mahout doesn't have a general input format for classifiable data (this
> has been discussed recently)
>
> b) hashed vector representations are not suitable for feature sharding
> since
> individual features may be redundantly represented in many locations.
>
> c) mahout doesn't have a reasonable data structure for general data
> transfer
> (related to -a-)
>
> One possible thought is that Mahout could introduce Weka as a dependency.
>
> The virtues would be:
>
> 1) Weka has ARFF as a data format and Instance as an object to satisfy (a)
> and (c)
>
> 2) Weka provides a bunch of simple classifier algorithms which are not
> individually scalable, but might be made to be so by model averaging or
> feature sharding.
>
> 3) Praneet could finish his project very quickly.
>
> Any thoughts about this?
>
> The problems that I see with this include:
>
> A) Weka is GPL which might slow adoption of Mahout and would certainly
> inhibit direct incorporation of any piece of Weka
>
> B) Weka appears to have not caught the maven bug which makes it harder to
> add as a dependency without actually distributing the weka jar.
>
> One possible work-around might be to reverse engineer something like
> Instance and an ARFF reader/writer.
>

Reply via email to