On Thu, May 6, 2010 at 2:18 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > > The only issues with the current distributed Lanczos solver is storage for > the auxiliary matrices as they are produced. Jake intimated that the had a > solution for that that wasn't prime-time yet. >
Measuring exactly how much this can affect you is easy, as well: the eigenvectors which currently live in memory on the "driver" machine (where you launch your hadoop job from) take up k*M*8 bytes, where k is the number of eigenvectors you're aiming for, and M is the dimensionality of the rows of your input matrix. This may actually require currently twice this value, as the basis which is built up is transformed once during an orthonormalization step. So if you want 250 singular vectors, and your input matrix is comprised of twenty seven gazillion input vectors, each of dimension 10 million, then you would need between 20-40GB of RAM on the driver box (this value being independent of what "a gazillion" is, of course). If there's only 1 million features per row, instead of 10 million, this drops down to 2-4GB needed. The solution I have which isn't ready for prime time, is very simple, and could be done by really anyone (meaning you don't really need to know too much fancy linear algebra or SVD internals) who had the time (which I don't right now, as I'm traveling abroad and have a million "real life" issues I'm dealing with at present): if you look in LanczosSolver.solve(), the SparseRowMatrix called "basis" is used, as one might imagine, to store the set of basis vectors which the Lanczos iteration is generating. Instead of keeping that set in memory, on line 136 the new vector could be persisted to disk (either local, or to HDFS [as a DistributedRowMatrix], or a DB if reliability is a concern). The only catch is that the basis matrix is currently used inside that loop on line 126: orthoganalizeAgainstAllButLast(nextVector, basis); Technically, this is overkill, as Lanczos should have already made sure that nextVector is orthogonal to all of the previous vectors in the basis, but I don't really completely trust that all overflow/underflow issues with the finite precision we have in double values, when applied to matrix multiplication with enormous matrices, have been dealt with completely. Ideally we could have a flag for LanczosSolver.solve(), for "do it safely / do it fast", and either skip that line or not. In the case where it is "do it safely", the slow way is to use the disk-saved matrix to do this method either locally (one vector at a time) or via another small hadoop job (which should actually just be the following: nextVector.assign(basis.timesSquared(nextVector), Functions.minus) ). Similarly, the loop at lines 157-172 is just building out the actual eigenvectors from the basis and the (small) eigenvectors of the tri-diagonal auxiliary matrix. This could be done by a small hadoop job, or just done without hanging onto all of the eigenvectors in memory at the same time (since none of them require each other to be computed). If someone wants to submit a patch doing some or all of this, I'd be happy to review and commit it. Otherwise I'll get it done sometime in the next couple of months, but possibly not before mid-july or so. -jake > > On Thu, May 6, 2010 at 12:20 PM, Jake Mannix <jake.man...@gmail.com> > wrote: > > > Tamas, > > > > MAHOUT-371 will be able to leverage the existing > DistributedLanczosSolver > > and DistributedRowMatrix (in o.a.m.math.decomposer.hadoop package in > core) > > to do full sparse truncated SVD on the entire user-item matrix already, > so > > that part is taken care of. > > > > -jake > > > > On Thu, May 6, 2010 at 11:38 AM, Tamas Jambor <jambo...@googlemail.com > > >wrote: > > > > > that looks interesting, but quite general. I'd be interested to know > how > > he > > > plans to divide the task that will be distributed. I mean SVD in > general > > > takes the whole user-item matrix, so it will be challenging to find a > > good > > > way to divide the task. Papers written on SVD do not discuss this > aspect, > > as > > > far as I know. > > > > > > > > > On 06/05/2010 18:32, Sean Owen wrote: > > > > > >> We're lucky to have a GSoC student implementing this over the summer: > > >> https://issues.apache.org/jira/browse/MAHOUT-371 > > >> > > >> On Thu, May 6, 2010 at 6:28 PM, Tamas Jambor<jambo...@googlemail.com> > > >> wrote: > > >> > > >> > > >>> I am looking into the problem of distributed SVD for recommender > > systems. > > >>> does anyone know whether someone else tried to tackle this problem > > >>> before? > > >>> > > >>> > > >>> > > >> > > >