discussion of input conversions

Ted Dunning Wed, 24 Aug 2011 15:10:09 -0700

Praneet and I were just talking about a project he is working on to do with
higher-order learning methods such as boosting and feature sharding.  This
is all pretty much in the context of classification and possibly clustering.


The problems are:

a) mahout doesn't have a general input format for classifiable data (this
has been discussed recently)

b) hashed vector representations are not suitable for feature sharding since
individual features may be redundantly represented in many locations.

c) mahout doesn't have a reasonable data structure for general data transfer
(related to -a-)

One possible thought is that Mahout could introduce Weka as a dependency.

The virtues would be:

1) Weka has ARFF as a data format and Instance as an object to satisfy (a)
and (c)

2) Weka provides a bunch of simple classifier algorithms which are not
individually scalable, but might be made to be so by model averaging or
feature sharding.

3) Praneet could finish his project very quickly.

Any thoughts about this?

The problems that I see with this include:

A) Weka is GPL which might slow adoption of Mahout and would certainly
inhibit direct incorporation of any piece of Weka

B) Weka appears to have not caught the maven bug which makes it harder to
add as a dependency without actually distributing the weka jar.

One possible work-around might be to reverse engineer something like
Instance and an ARFF reader/writer.

discussion of input conversions

Reply via email to