> ------------ Původní zpráva ------------
> Od: Dmitriy Lyubimov <[email protected]>
> Předmět: Re: stochastic SVD
> Datum: 07.9.2010 01:42:14
> ----------------------------------------
> you may want to take a look at mahout-376.
>
> https://issues.apache.org/jira/browse/MAHOUT-376

Nice, if I understand these documents correctly, you also want to make the 
algorithm memory-independent of the feature set size (not just corpus size). 
That would be awesome! My laptop already starts complaining when I extract 1000 
factors (~2000 with oversampling), with #features=100K. And I can imagine users 
could easily want more than 1000 factors/100K features (though nobody did so 
far :-)

But I don't understand the step where you orthonormalize Q: it looks like you 
want to do it in blocks of Y, but orth(Y1 ; Y2 ; Y3 ...) != orth(Y1) ; orth(Y2) 
; orth(Y3) ... (the first two equations of sd.pdf). What is the trick behind 
that?


Radim



>
>
> On Sun, Sep 5, 2010 at 10:37 PM, RadimRehurek <[email protected]>wrote:
>
> > Hello Dmitriy,
> >
> > > i came very close to try out Ted's layout for stochastic svd (the way i
> > > understood it, with block QR solvers on mapper side instead of
> > gram-schmidt
> > > on the whole scale as Tropp seems to suggest ) and not following mahout's
> > > general architecture, but I wasn't actually able to  carve out enough
> > time
> > > for this. But sooner or later somebody will implement that, and then
> > stuff
> >
> > you are right -- I kicked myself to finally implementing that paper after
> > seeing Daisuke Okanohara's "redsvd" :-)
> >
> > * http://code.google.com/p/redsvd/wiki/English , in-core stochastic
> > decomposition on top of eigen3, a beautiful C++ template library
> >
> > In any case, it looks like the power iterations are really needed already
> > for mid-sized problems like Wikipedia. Either that, or use very aggressive
> > over-sampling (I use #samples=2*requested_rank at the moment, and it's not
> > enough). That means doing extra passes over the data to compute
> > Y=(A*A^T)^q*A*G instead of just Y=A*G :-( ... unless there is a clever way
> > to avoid it.
> >
> > Radim
> >
> > > On Sun, Sep 5, 2010 at 7:33 PM, RadimRehurek <[email protected]>
> > wrote:
> > >
> > > > See that module's docstring; reading the input is slower than
> > processing it
> > > > with the stochastic decomposition.
> > > >
> > > > In short: in order for distributed computing to make sense
> > > > (performance-wise), the data would already need to be pre-distributed,
> > too.
> > > >
> > > > This is true in Hadoop, so I guess stochastic decomposition is an algo
> > > > where MAHOUT could really make a difference on terabyte+ problems.
> > > >
> > > > Radim
> > > >
> > >
> > >
> > >
> >
>
>
>

Reply via email to