On Wed, Oct 13, 2010 at 10:36 AM, Dmitriy Lyubimov <[email protected]>wrote:

> I see. Very interesting.
>
> the only problem (well something i perceive as a problem) is not even that
> B
> got inflated so much but rather that reduced SVD problem is not (k+p)x(k+p)
> problem anymore. There are two things here:
>
> -- user doesn't really set actual precision anymore (k+p was supposed to be
> the lever);
>

They still set k+p.  Q still has the same number of non-zero elements and
convergence still works the same as it did before.


> -- the reduced svd problem dimensions now ~m. Initially i thought the
> philosophy behind that was that we want to be solving a streaming problem
> of
> m x n size and reduce it to a problem that doesn't depend on m or n and
> memory-wise n is only constrained by our memory settings on the mapper. in
> realiy under this circumstances, m can easily be 1E6 (8 mb dense vector) or
> more (default hadoop mapper setting -Xmx200m). m is not bound at all by
> memory constraints (i.e. streaming goes along m). So in example to try that
> i thought of, m x n can be 1E9x1E6, e.g. petabyte scale problem (sort of
> SVD
> version of a Terasort benchmark). But i guess if BBt dimensions are now
> ~s(k+p), where s~m, then it is not true anymore and m is theoretically
> bounded as well (whether it is a practical issue or not is not my point.
> most likely it is not. )


It isn't that bad (quite).

The expansion of the number of rows of B is actually proportional to the
number of blocks in A.  This is proportional to the size of A and will set a
limit, but hopefully a large enough limit to be very useful.

B is also denser than A so that B will be harder to store than the same
number of rows of A.  The densification is proportional to the number of
blocks of A.

In general, however, I think that there are strategies for dealing with B
out of core even without computing B B^T.
I haven't worked out the details yet, but one thought is that since we are
after left singular vectors of B, it might be
possible to do a row-wise streaming decomposition of B that would give us a
second Q that we could decompose
easily.  If we can construct Q without having to have all of B in memory at
once, we could win that way.


>
> This kind of shifts weight of computation from MR side to what i think is a
> single threaded eigensolver. I would like to spend just a tad little more
> time to poke around to see if there's still a way to make MR to work a
> little bit harder.
>

I agree.  It might also be possible to do the whole random projection trick
on the left side this time to get a cheap decomposition
of B.

Reply via email to