I assume that the proper etiquette is to send all correspondences to the list, that side conversations are discouraged.
I apologize in advance if I appear ignorant of all that Mahout has to offer. I am currently reading though the book by Mahout in Action and looking at the code. The regression algorithm (real inputs, real outputs) that I am working to implement is LSQR. It is an iterative algorithm based on the conjugate gradient method. It has a nice property that is only uses calculations from Ax and A'y where A is a matrix, A' is the transpose, and x & y are vectors. I would assume that with big data, one wants to avoid calculating A inverse since this is O(n^3). Also, with sparse matrices, one wants to avoid calculating AA' because the result can be dense. Below is a link to the algorithm. http://www.stanford.edu/group/SOL/software/lsqr.html One issue I am working on is calculating Ax and A'y with only one pass through A but perhaps multiple passes through x and y. In general, for regression problems A is m rows and n columns where m >> n. The vector x has the dimension n and the vector y has the dimension m. It might be the case the x will fit in memory but I think the design should not assume this. I also want to look at the issues where A is dense and A is sparse. LSQR can be used for maximum likelihood estimation of parameters of models that are linear in their coefficients. I will first work on linear regression then investigate the issue of using other basis sets. There may already be work in Mahout to draw on for this issue. I think this will keep me busy for a while. Future issues would include: 1. subset selection using the lasso or ridge regression 2. simple bayesian regression, which their may already be work to draw on in Mahout 3. special basis sets My reference for least squares is: Bjorck, Ake, (1996) Numerical Methods for Least Squares Problems http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=numerical+least+squares&x=0&y=0&ih=12_7_1_1_0_1_0_0_1_1.96_104&fsc=-1 My reference for numerical linear algebra is: Saad, Yousef, (2003), Iterative Methods for Sparse Linear Systems http://www-users.cs.umn.edu/~saad/books.html ------------- Sincerely, David G. Boney [email protected] http://www.sonicartifacts.com On Aug 24, 2010, at 9:29 PM, Ted Dunning wrote: > On Tue, Aug 24, 2010 at 1:28 PM, David G. Boney > <[email protected]>wrote: > >> >> I would like to contribute to Mahout. Who would be the point person on the >> following topics: linear algebra routines, regression (real inputs, real >> outputs), subset selection for regression (lasso or ridge regression), and >> spectral graph methods? >> > > Several of us can help on linear algebra. > > We have no linear regression to speak of at this point. > > I have done a fair bit of work on gradient descent for regularization. > > We have the beginnings of a spectral clustering model (not the same as > general spectral graph meethods) and we have an OK, but not widely used > large-scale eigenvector decomposer. > > I am in the process of implementing a least squares linear regression >> algorithm, LSQR. In my quick review of Mahout, and I make no claims of >> digging into the code at this point, there appears to be extensive work in >> the area of classifiers, discrete outputs, but not regression, real output. >> I have an interest in building up a library of regression techniques (real >> inputs, real outputs). > > > As far as Mahout is concerned, scalability is the key question. For many > regression problems, especially those with sparse inputs, gradient descent > is very effective and the current SGD for logistic regression could probably > be leveraged. My guess is that for non-gradient descent methods, the SVD > decompositions would be a better starting point. > > I believe that Vowpal Wabbit can be used for linear regression which > probably implies that Mahout's SGD solver could be as well with small > changes to the gradient computation. > > >> I am also interested in the implementation of the numerical linear algebra >> routines, as these algorithms are at the crux of most regression problems. >> > > We would love more help with this, especially in distributed cases. Raw > numerical speed is rarely the bottleneck for mahout code because large scale > systems are typically I/O and network bound. That said, much of our matrix > code is still as we inherited it and while the quality is surprisingly high > for code without unit tests, I know from direct experience that there are > problems. I am testing and correcting things as I need them, but you would > likely have a broader reach and thus might have substantially more impact > than I have had on the testing side.
