On Tue, Aug 24, 2010 at 1:28 PM, David G. Boney <[email protected]>wrote:
> > I would like to contribute to Mahout. Who would be the point person on the > following topics: linear algebra routines, regression (real inputs, real > outputs), subset selection for regression (lasso or ridge regression), and > spectral graph methods? > Several of us can help on linear algebra. We have no linear regression to speak of at this point. I have done a fair bit of work on gradient descent for regularization. We have the beginnings of a spectral clustering model (not the same as general spectral graph meethods) and we have an OK, but not widely used large-scale eigenvector decomposer. I am in the process of implementing a least squares linear regression > algorithm, LSQR. In my quick review of Mahout, and I make no claims of > digging into the code at this point, there appears to be extensive work in > the area of classifiers, discrete outputs, but not regression, real output. > I have an interest in building up a library of regression techniques (real > inputs, real outputs). As far as Mahout is concerned, scalability is the key question. For many regression problems, especially those with sparse inputs, gradient descent is very effective and the current SGD for logistic regression could probably be leveraged. My guess is that for non-gradient descent methods, the SVD decompositions would be a better starting point. I believe that Vowpal Wabbit can be used for linear regression which probably implies that Mahout's SGD solver could be as well with small changes to the gradient computation. > I am also interested in the implementation of the numerical linear algebra > routines, as these algorithms are at the crux of most regression problems. > We would love more help with this, especially in distributed cases. Raw numerical speed is rarely the bottleneck for mahout code because large scale systems are typically I/O and network bound. That said, much of our matrix code is still as we inherited it and while the quality is surprisingly high for code without unit tests, I know from direct experience that there are problems. I am testing and correcting things as I need them, but you would likely have a broader reach and thus might have substantially more impact than I have had on the testing side.
