Re: Contributing regression routines to Mahout

David G. Boney Sat, 28 Aug 2010 18:15:46 -0700

I am going to stick with LSQR. Part of this is an exercise to learn how to use 
Hadoop and Mahout. I also consulted with Diane O'Leary at University of 
Maryland, a leading numerical linear algebra researcher (and a former professor 
of mine). She highly recommended the algorithm. I want to start with something 
well know and reliable.


I am very interested in making additional contributions to the numerical linear 
algebra system. I will gladly review the method you suggested for future 
projects. 

Right now I am reading up on iterative methods in a book by Henk A. van der 
Vorst. 

Henk A. van der Vorst, (2009), Iterative Krylov Methods for Large Linear 
Systems.
http://www.amazon.com/Iterative-Cambridge-Monographs-Computational-Mathematics/dp/0521183707/ref=sr_1_3?ie=UTF8&s=books&qid=1283044114&sr=8-3
-------------
Sincerely,
David G. Boney
[email protected]
http://www.sonicartifacts.com




On Aug 28, 2010, at 7:45 PM, Ted Dunning wrote:

> LSQR is essentially a Krylov sub-space technique since it is based on
> Lanczos algorithm.  As such, it is highly iterative and requires many passes
> through the data.
> 
> It might be more efficient if you could use a random projection technique.
> These algorithms can generally do the work of many
> Krylov iterations in a single pass over the data.  I haven't looked into
> this in detail, but it seems to me that a single pass could get you the
> least-squares fit for the first k < 100 singular values in a single pass and
> that subsequent passes could be used
> on the error to get the effect of 100 singular values per iteration.
> 
> The crux of the idea is that if you right multiply your large matrix A by a
> tall skinny matrix random matrix R, you should be able to
> generate an orthonormal matrix Q such that Q Y = A R.  The exciting part is
> that Q Q' A will be a close approximation of A in terms
> of Frobenius norm.  Minimizing || Q Q' A x - b ||_F should be possible
> fairly quickly without actually constructing Q Q' A because
> (I think) you should be able to minimize || Q' A x - Q' b ||_F instead.
> Moreover, since Q' A has limited rank, I think you can use
> (Q' A)' (Q' A) to find the answer you want without the horrible numerical
> problems that approach usually entails.
> 
> Less directly, you can form the partial SVD and use that to solve for x.
> 
> Solving the L_2 regularized system would proceed essentially the same way
> with a fancier value for A.  I don't know how this would
> extend to solving the L_1 regularized system.
> 
> Since just reading A is the dominant cost and since you get something
> proportional to k singular vectors in a single determination of
> Q versus something proportional to 1 with each Krylov iteration, this could
> get you two or three orders of magnitude improvement
> in speed.
> 
> As usual, I would refer you to Halko, Martinsson, and Tropp's great survey
> article at http://arxiv.org/abs/0909.4061v1
> 
> They refer to some approaches for least squares solutions, but I haven't
> looked at the references that they cite.
> 
> 
> On Tue, Aug 24, 2010 at 10:21 PM, David G. Boney 
> <[email protected]>wrote:
> 
>> I assume that the proper etiquette is to send all correspondences to the
>> list, that side conversations are discouraged.
>> 
>> I apologize in advance if I appear ignorant of all that Mahout has to
>> offer. I am currently reading though the book by Mahout in Action and
>> looking at the code.
>> 
>> The regression algorithm (real inputs, real outputs) that I am working to
>> implement is LSQR. It is an iterative algorithm based on the conjugate
>> gradient method. It has a nice property that is only uses calculations from
>> Ax and A'y where A is a matrix, A' is the transpose, and x & y are vectors.
>> I would assume that with big data, one wants to avoid calculating A inverse
>> since this is O(n^3). Also, with sparse matrices, one wants to avoid
>> calculating AA' because the result can be dense. Below is a link to the
>> algorithm.
>> 
>> http://www.stanford.edu/group/SOL/software/lsqr.html
>> 
>> One issue I am working on is calculating Ax and A'y with only one pass
>> through A but perhaps multiple passes through x and y. In general, for
>> regression problems A is m rows and n columns where m >> n. The vector x has
>> the dimension n and the vector y has the dimension m. It might be the case
>> the x will fit in memory but I think the design should not assume this.  I
>> also want to look at the issues where A is dense and A is sparse.
>> 
>> LSQR can be used for maximum likelihood estimation of parameters of models
>> that are linear in their coefficients. I will first work on linear
>> regression then investigate the issue of using other basis sets. There may
>> already be work in Mahout to draw on for this issue.
>> 
>> I think this will keep me busy for a while.
>> 
>> Future issues would include:
>> 1. subset selection using the lasso or ridge regression
>> 2. simple bayesian regression, which their may already be work to draw on
>> in Mahout
>> 3. special basis sets
>> 
>> My reference for least squares is:
>> Bjorck, Ake, (1996) Numerical Methods for Least Squares Problems
>> 
>> http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=numerical+least+squares&x=0&y=0&ih=12_7_1_1_0_1_0_0_1_1.96_104&fsc=-1
>> 
>> My reference for numerical linear algebra is:
>> Saad, Yousef, (2003), Iterative Methods for Sparse Linear Systems
>> http://www-users.cs.umn.edu/~saad/books.html
>> 
>> -------------
>> Sincerely,
>> David G. Boney
>> [email protected]
>> http://www.sonicartifacts.com
>> 
>> 
>> 
>> 
>> On Aug 24, 2010, at 9:29 PM, Ted Dunning wrote:
>> 
>>> On Tue, Aug 24, 2010 at 1:28 PM, David G. Boney <[email protected]
>>> wrote:
>>> 
>>>> 
>>>> I would like to contribute to Mahout. Who would be the point person on
>> the
>>>> following topics: linear algebra routines, regression (real inputs, real
>>>> outputs), subset selection for regression (lasso or ridge regression),
>> and
>>>> spectral graph methods?
>>>> 
>>> 
>>> Several of us can help on linear algebra.
>>> 
>>> We have no linear regression to speak of at this point.
>>> 
>>> I have done a fair bit of work on gradient descent for regularization.
>>> 
>>> We have the beginnings of a spectral clustering model (not the same as
>>> general spectral graph meethods) and we have an OK, but not widely used
>>> large-scale eigenvector decomposer.
>>> 
>>> I am in the process of implementing a least squares linear regression
>>>> algorithm, LSQR. In my quick review of Mahout, and I make no claims of
>>>> digging into the code at this point, there appears to be extensive work
>> in
>>>> the area of classifiers, discrete outputs, but not regression, real
>> output.
>>>> I have an interest in building up a library of regression techniques
>> (real
>>>> inputs, real outputs).
>>> 
>>> 
>>> As far as Mahout is concerned, scalability is the key question.  For many
>>> regression problems, especially those with sparse inputs, gradient
>> descent
>>> is very effective and the current SGD for logistic regression could
>> probably
>>> be leveraged.  My guess is that for non-gradient descent methods, the SVD
>>> decompositions would be a better starting point.
>>> 
>>> I believe that Vowpal Wabbit can be used for linear regression which
>>> probably implies that Mahout's SGD solver could be as well with small
>>> changes to the gradient computation.
>>> 
>>> 
>>>> I am also interested in the implementation of the numerical linear
>> algebra
>>>> routines, as these algorithms are at the crux of most regression
>> problems.
>>>> 
>>> 
>>> We would love more help with this, especially in distributed cases.  Raw
>>> numerical speed is rarely the bottleneck for mahout code because large
>> scale
>>> systems are typically I/O and network bound.  That said, much of our
>> matrix
>>> code is still as we inherited it and while the quality is surprisingly
>> high
>>> for code without unit tests, I know from direct experience that there are
>>> problems.  I am testing and correcting things as I need them, but you
>> would
>>> likely have a broader reach and thus might have substantially more impact
>>> than I have had on the testing side.
>> 
>>

Re: Contributing regression routines to Mahout

Reply via email to