Hi Ted,

> I thought we would need.  One question I have is you you dealt with the need
> to keep a large amount of data in memory during the internal SVD step.  Can
> you write up your approach on a wiki page or attach a document to the JIRA?

the approach is a pretty straightforward translation of what is described in 
the paper, you can see the working version at
http://my-trac.assembla.com/gensim/browser/trunk/src/gensim/models/lsimodel.py

(stochasticSvd() at the bottom, last ~80 lines). Assuming the input A is a 
(#features x #documents) matrix (streamed in columns=documents), my 
implementation still needs O(#features) memory (in particular, 2 * 
sizeof(double) * #features * #samples bytes).

The only difference to Halko et al is, I couldn't do the one-pass approach 
suggested in that paper, because it requires O(#docs) memory (prohibitive and I 
couldn't find any way to avoid it). In my application, I am only interested in 
the predictive projection (S^-1 * U^-1), so I don't even compute the right 
singular vectors V (their size is O(#docs), too, anyway), so I effectively 
compute eigendecomposition (U, S^2) only. I figured if V are really needed, 
they can always be computed in another pass over the input, V = (S^-1 * U^-1 ) 
* A. Since this randomized algo is multi-pass anyway, there is no fundamental 
difference between 2 passes or 3 passes (as opposed to 1-pass vs 2-pass)...

Looking forward to your implementation so we can compare!
Radim

> 
> As far as I know, there hasn't been much progress since MAHOUT-309 was
> posted.  Can you put up a patch on that JIRA so we can compare notes.
> 
> On Fri, Sep 3, 2010 at 11:22 AM, RadimRehurek <[email protected]>wrote:
> 
> >
> > Hello everyone,
> >
> > I just implemented an eigensolver based on Halko's article "Finding
> > structure with randomness", for streamed input. I couldn't find any way to
> > do it in a single pass (without requiring O(number of observations) of
> > memory), so my version works in two passes over the input (no power
> > iterations).
> >
> > I was looking around to see if there are other streamed (out-of-core)
> > implementations, to compare and perhaps get inspiration ;) and I came across
> > MAHOUT-309: https://issues.apache.org/jira/browse/MAHOUT-309
> >
> > That issue seems very quiet though, how far along did you guys get?
> >
> > This stochastic algorithm seems pretty fast: 2.5h on the English Wikipedia
> > (3.2M documents, 200K features, 0.5G non-zeros) for 400 factors, compared to
> > 14h for the incremental, non-stochastic one-pass algo, on my MacBook. And I
> > think it's more accurate, too, but I'll have to run some more tests.
> >
> > I'm curious to hear what your experience was, cheers,
> > Radim
> >
> 
> 
> 

Reply via email to