Re: Contributing regression routines to Mahout

Ted Dunning Tue, 24 Aug 2010 19:30:18 -0700

On Tue, Aug 24, 2010 at 1:28 PM, David G. Boney <[email protected]>wrote:


>
> I would like to contribute to Mahout. Who would be the point person on the
> following topics: linear algebra routines, regression (real inputs, real
> outputs), subset selection for regression (lasso or ridge regression), and
> spectral graph methods?
>

Several of us can help on linear algebra.

We have no linear regression to speak of at this point.

I have done a fair bit of work on gradient descent for regularization.

We have the beginnings of a spectral clustering model (not the same as
general spectral graph meethods) and we have an OK, but not widely used
large-scale eigenvector decomposer.

I am in the process of implementing a least squares linear regression
> algorithm, LSQR. In my quick review of Mahout, and I make no claims of
> digging into the code at this point, there appears to be extensive work in
> the area of classifiers, discrete outputs, but not regression, real output.
> I have an interest in building up a library of regression techniques (real
> inputs, real outputs).


As far as Mahout is concerned, scalability is the key question.  For many
regression problems, especially those with sparse inputs, gradient descent
is very effective and the current SGD for logistic regression could probably
be leveraged.  My guess is that for non-gradient descent methods, the SVD
decompositions would be a better starting point.

I believe that Vowpal Wabbit can be used for linear regression which
probably implies that Mahout's SGD solver could be as well with small
changes to the gradient computation.


> I am also interested in the implementation of the numerical linear algebra
> routines, as these algorithms are at the crux of most regression problems.
>

We would love more help with this, especially in distributed cases.  Raw
numerical speed is rarely the bottleneck for mahout code because large scale
systems are typically I/O and network bound.  That said, much of our matrix
code is still as we inherited it and while the quality is surprisingly high
for code without unit tests, I know from direct experience that there are
problems.  I am testing and correcting things as I need them, but you would
likely have a broader reach and thus might have substantially more impact
than I have had on the testing side.

Re: Contributing regression routines to Mahout

Reply via email to