Great stuff.  The stochastic method should be very fast.  Two passes is what
I thought we would need.  One question I have is you you dealt with the need
to keep a large amount of data in memory during the internal SVD step.  Can
you write up your approach on a wiki page or attach a document to the JIRA?

As far as I know, there hasn't been much progress since MAHOUT-309 was
posted.  Can you put up a patch on that JIRA so we can compare notes.

On Fri, Sep 3, 2010 at 11:22 AM, RadimRehurek <[email protected]>wrote:

>
> Hello everyone,
>
> I just implemented an eigensolver based on Halko's article "Finding
> structure with randomness", for streamed input. I couldn't find any way to
> do it in a single pass (without requiring O(number of observations) of
> memory), so my version works in two passes over the input (no power
> iterations).
>
> I was looking around to see if there are other streamed (out-of-core)
> implementations, to compare and perhaps get inspiration ;) and I came across
> MAHOUT-309: https://issues.apache.org/jira/browse/MAHOUT-309
>
> That issue seems very quiet though, how far along did you guys get?
>
> This stochastic algorithm seems pretty fast: 2.5h on the English Wikipedia
> (3.2M documents, 200K features, 0.5G non-zeros) for 400 factors, compared to
> 14h for the incremental, non-stochastic one-pass algo, on my MacBook. And I
> think it's more accurate, too, but I'll have to run some more tests.
>
> I'm curious to hear what your experience was, cheers,
> Radim
>

Reply via email to