Hi Ted, > I thought we would need. One question I have is you you dealt with the need > to keep a large amount of data in memory during the internal SVD step. Can > you write up your approach on a wiki page or attach a document to the JIRA?
the approach is a pretty straightforward translation of what is described in the paper, you can see the working version at http://my-trac.assembla.com/gensim/browser/trunk/src/gensim/models/lsimodel.py (stochasticSvd() at the bottom, last ~80 lines). Assuming the input A is a (#features x #documents) matrix (streamed in columns=documents), my implementation still needs O(#features) memory (in particular, 2 * sizeof(double) * #features * #samples bytes). The only difference to Halko et al is, I couldn't do the one-pass approach suggested in that paper, because it requires O(#docs) memory (prohibitive and I couldn't find any way to avoid it). In my application, I am only interested in the predictive projection (S^-1 * U^-1), so I don't even compute the right singular vectors V (their size is O(#docs), too, anyway), so I effectively compute eigendecomposition (U, S^2) only. I figured if V are really needed, they can always be computed in another pass over the input, V = (S^-1 * U^-1 ) * A. Since this randomized algo is multi-pass anyway, there is no fundamental difference between 2 passes or 3 passes (as opposed to 1-pass vs 2-pass)... Looking forward to your implementation so we can compare! Radim > > As far as I know, there hasn't been much progress since MAHOUT-309 was > posted. Can you put up a patch on that JIRA so we can compare notes. > > On Fri, Sep 3, 2010 at 11:22 AM, RadimRehurek <[email protected]>wrote: > > > > > Hello everyone, > > > > I just implemented an eigensolver based on Halko's article "Finding > > structure with randomness", for streamed input. I couldn't find any way to > > do it in a single pass (without requiring O(number of observations) of > > memory), so my version works in two passes over the input (no power > > iterations). > > > > I was looking around to see if there are other streamed (out-of-core) > > implementations, to compare and perhaps get inspiration ;) and I came across > > MAHOUT-309: https://issues.apache.org/jira/browse/MAHOUT-309 > > > > That issue seems very quiet though, how far along did you guys get? > > > > This stochastic algorithm seems pretty fast: 2.5h on the English Wikipedia > > (3.2M documents, 200K features, 0.5G non-zeros) for 400 factors, compared to > > 14h for the incremental, non-stochastic one-pass algo, on my MacBook. And I > > think it's more accurate, too, but I'll have to run some more tests. > > > > I'm curious to hear what your experience was, cheers, > > Radim > > > > >
