I see. Very interesting. the only problem (well something i perceive as a problem) is not even that B got inflated so much but rather that reduced SVD problem is not (k+p)x(k+p) problem anymore. There are two things here:
-- user doesn't really set actual precision anymore (k+p was supposed to be the lever); -- the reduced svd problem dimensions now ~m. Initially i thought the philosophy behind that was that we want to be solving a streaming problem of m x n size and reduce it to a problem that doesn't depend on m or n and memory-wise n is only constrained by our memory settings on the mapper. in realiy under this circumstances, m can easily be 1E6 (8 mb dense vector) or more (default hadoop mapper setting -Xmx200m). m is not bound at all by memory constraints (i.e. streaming goes along m). So in example to try that i thought of, m x n can be 1E9x1E6, e.g. petabyte scale problem (sort of SVD version of a Terasort benchmark). But i guess if BBt dimensions are now ~s(k+p), where s~m, then it is not true anymore and m is theoretically bounded as well (whether it is a practical issue or not is not my point. most likely it is not. ) This kind of shifts weight of computation from MR side to what i think is a single threaded eigensolver. I would like to spend just a tad little more time to poke around to see if there's still a way to make MR to work a little bit harder. -d On Tue, Oct 12, 2010 at 11:10 PM, Ted Dunning <[email protected]> wrote: > I don't think it would be a big problem to have this dependency, but I > would > prefer to simply port > the eigenvalue/svd decomposition from math to use our vectors directly. We > need such a port > and they have tests for it already. I am pretty sure that CM's svd is > higher quality than Colt's in > any case. > > If there is a way to use our vectors and commons math's code, that would be > lovely. I kind of > doubt it, however. > > On Tue, Oct 12, 2010 at 7:37 PM, Dmitriy Lyubimov <[email protected]> > wrote: > > > -- i also ended up using eigen from apache commons math 2.1 . But math > > module has dependency on it but core module (which is also math heavy ) > > doesn't have such a dependency. Is it a big deal if we add one to the > core > > module too? > > >
