Great stuff. The stochastic method should be very fast. Two passes is what I thought we would need. One question I have is you you dealt with the need to keep a large amount of data in memory during the internal SVD step. Can you write up your approach on a wiki page or attach a document to the JIRA?
As far as I know, there hasn't been much progress since MAHOUT-309 was posted. Can you put up a patch on that JIRA so we can compare notes. On Fri, Sep 3, 2010 at 11:22 AM, RadimRehurek <[email protected]>wrote: > > Hello everyone, > > I just implemented an eigensolver based on Halko's article "Finding > structure with randomness", for streamed input. I couldn't find any way to > do it in a single pass (without requiring O(number of observations) of > memory), so my version works in two passes over the input (no power > iterations). > > I was looking around to see if there are other streamed (out-of-core) > implementations, to compare and perhaps get inspiration ;) and I came across > MAHOUT-309: https://issues.apache.org/jira/browse/MAHOUT-309 > > That issue seems very quiet though, how far along did you guys get? > > This stochastic algorithm seems pretty fast: 2.5h on the English Wikipedia > (3.2M documents, 200K features, 0.5G non-zeros) for 400 factors, compared to > 14h for the incremental, non-stochastic one-pass algo, on my MacBook. And I > think it's more accurate, too, but I'll have to run some more tests. > > I'm curious to hear what your experience was, cheers, > Radim >
