Re: distributed SVD

Jake Mannix Fri, 07 May 2010 06:10:37 -0700

On Thu, May 6, 2010 at 2:18 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
>
> The only issues with the current distributed Lanczos solver is storage for
> the auxiliary matrices as they are produced.  Jake intimated that the had a
> solution for that that wasn't prime-time yet.
>

Measuring exactly how much this can affect you is easy, as well:  the
eigenvectors
which currently live in memory on the "driver" machine (where you launch
your
hadoop job from) take up k*M*8 bytes, where k is the number of eigenvectors
you're aiming for, and M is the dimensionality of the rows of your input
matrix.
This may actually require currently twice this value, as the basis which is
built
up is transformed once during an orthonormalization step.

So if you want 250 singular vectors, and your input matrix is comprised of
twenty seven gazillion input vectors, each of dimension 10 million, then you
would need between 20-40GB of RAM on the driver box (this value being
independent of what "a gazillion" is, of course).  If there's only 1 million
features per row, instead of 10 million, this drops down to 2-4GB needed.

The solution I have which isn't ready for prime time, is very simple, and
could be done by really anyone (meaning you don't really need to know too
much fancy linear algebra or SVD internals) who had the time (which I don't
right now, as I'm traveling abroad and have a million "real life" issues
I'm
dealing with at present):  if you look in LanczosSolver.solve(), the
SparseRowMatrix called "basis" is used, as one might imagine, to store
the set of basis vectors which the Lanczos iteration is generating.  Instead
of keeping that set in memory, on line 136 the new vector could be
persisted to disk (either local, or to HDFS [as a DistributedRowMatrix],
or a DB if reliability is a concern).  The only catch is that the basis
matrix is currently used inside that loop on line 126:

  orthoganalizeAgainstAllButLast(nextVector, basis);

Technically, this is overkill, as Lanczos should have already made sure
that nextVector is orthogonal to all of the previous vectors in the basis,
but I don't really completely trust that all overflow/underflow issues with
the finite precision we have in double values, when applied to matrix
multiplication with enormous matrices, have been dealt with completely.
Ideally we could have a flag for LanczosSolver.solve(), for "do it safely /
do it fast", and either skip that line or not.  In the case where it is "do
it safely", the slow way is to use the disk-saved matrix to do this
method either locally (one vector at a time) or via another small
hadoop job (which should actually just be the following:
nextVector.assign(basis.timesSquared(nextVector), Functions.minus) ).

Similarly, the loop at lines 157-172 is just building out the actual
eigenvectors from the basis and the (small) eigenvectors of the tri-diagonal
auxiliary matrix.  This could be done by a small hadoop job, or just done
without hanging onto all of the eigenvectors in memory at the same time
(since none of them require each other to be computed).

If someone wants to submit a patch doing some or all of this, I'd be happy
to review and commit it.  Otherwise I'll get it done sometime in the next
couple of months, but possibly not before mid-july or so.

  -jake

>
> On Thu, May 6, 2010 at 12:20 PM, Jake Mannix <jake.man...@gmail.com>
> wrote:
>
> > Tamas,
> >
> >  MAHOUT-371 will be able to leverage the existing
> DistributedLanczosSolver
> > and DistributedRowMatrix (in o.a.m.math.decomposer.hadoop package in
> core)
> > to do full sparse truncated SVD on the entire user-item matrix already,
> so
> > that part is taken care of.
> >
> >  -jake
> >
> > On Thu, May 6, 2010 at 11:38 AM, Tamas Jambor <jambo...@googlemail.com
> > >wrote:
> >
> > > that looks interesting, but quite general. I'd be interested to know
> how
> > he
> > > plans to divide the task that will be distributed. I mean SVD in
> general
> > > takes the whole user-item matrix, so it will be challenging to find a
> > good
> > > way to divide the task. Papers written on SVD do not discuss this
> aspect,
> > as
> > > far as I know.
> > >
> > >
> > > On 06/05/2010 18:32, Sean Owen wrote:
> > >
> > >> We're lucky to have a GSoC student implementing this over the summer:
> > >> https://issues.apache.org/jira/browse/MAHOUT-371
> > >>
> > >> On Thu, May 6, 2010 at 6:28 PM, Tamas Jambor<jambo...@googlemail.com>
> > >>  wrote:
> > >>
> > >>
> > >>> I am looking into the problem of distributed SVD for recommender
> > systems.
> > >>> does anyone know whether someone else tried to tackle this problem
> > >>> before?
> > >>>
> > >>>
> > >>>
> > >>
> >
>

Re: distributed SVD

Reply via email to