On Wed, Oct 13, 2010 at 10:36 AM, Dmitriy Lyubimov <[email protected]>wrote:
> I see. Very interesting. > > the only problem (well something i perceive as a problem) is not even that > B > got inflated so much but rather that reduced SVD problem is not (k+p)x(k+p) > problem anymore. There are two things here: > > -- user doesn't really set actual precision anymore (k+p was supposed to be > the lever); > They still set k+p. Q still has the same number of non-zero elements and convergence still works the same as it did before. > -- the reduced svd problem dimensions now ~m. Initially i thought the > philosophy behind that was that we want to be solving a streaming problem > of > m x n size and reduce it to a problem that doesn't depend on m or n and > memory-wise n is only constrained by our memory settings on the mapper. in > realiy under this circumstances, m can easily be 1E6 (8 mb dense vector) or > more (default hadoop mapper setting -Xmx200m). m is not bound at all by > memory constraints (i.e. streaming goes along m). So in example to try that > i thought of, m x n can be 1E9x1E6, e.g. petabyte scale problem (sort of > SVD > version of a Terasort benchmark). But i guess if BBt dimensions are now > ~s(k+p), where s~m, then it is not true anymore and m is theoretically > bounded as well (whether it is a practical issue or not is not my point. > most likely it is not. ) It isn't that bad (quite). The expansion of the number of rows of B is actually proportional to the number of blocks in A. This is proportional to the size of A and will set a limit, but hopefully a large enough limit to be very useful. B is also denser than A so that B will be harder to store than the same number of rows of A. The densification is proportional to the number of blocks of A. In general, however, I think that there are strategies for dealing with B out of core even without computing B B^T. I haven't worked out the details yet, but one thought is that since we are after left singular vectors of B, it might be possible to do a row-wise streaming decomposition of B that would give us a second Q that we could decompose easily. If we can construct Q without having to have all of B in memory at once, we could win that way. > > This kind of shifts weight of computation from MR side to what i think is a > single threaded eigensolver. I would like to spend just a tad little more > time to poke around to see if there's still a way to make MR to work a > little bit harder. > I agree. It might also be possible to do the whole random projection trick on the left side this time to get a cheap decomposition of B.
