2011/11/3 Kenneth C. Arnold <[email protected]>: > On Wed, Nov 2, 2011 at 6:04 PM, Olivier Grisel <[email protected]> > wrote: >> 2011/11/2 Radim Rehurek <[email protected]>: >>> If you decide to implement the randomized PCA, I can offer some >>> observations: >>> >>> 1. oversampling does little, accuracy comes mostly from the extra power >>> iteration steps >>> 2. no power iterations result in miserable accuracy >>> 3. extra power iteration steps quickly lead to numerical overflows; but QR >>> is pretty fast, so in gensim, I orthonormalize the intermediate matrices H >>> after each power iteration step. That's exactly the same method that remark >>> 3.3 refers to. >> >> Interesting. The current implementation in scikit-learn (which is >> neither streamed nor parallel) does quite a bit of oversampling (if k >> components are rrequired, 2 * k random vectors are used) and uses 3 >> power iterations by default but does not do qr inside the power >> iteration steps: >> >> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/pca.py#L346 >> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/extmath.py#L126 > > I played with this a bit a while back. Trivial patch attached. > > How much the power iteration helps apparently depends on how quickly > the singular spectrum decays. The matrices I was playing with had > slowly decaying spectra; I found q=10 got acceptable results (judging > by whether the singular values matched the non-randomized results). I > didn't find that the orthonormalization step helped much for my case, > and the QR did slow things down significantly, so I didn't pursue this > patch further. Perhaps orthonormalizing every few iterations would be > a good balance. > > In contrast to the comments in fast_svd, I did not find (k+p) > > rank(M) to be necessary. The Halko paper suggests the same thing iirc. > > Also, I found the randomized range finder (algorithm 4.3) to be more > generally useful, as Halko et al suggests, so I pulled it into a > separate function. Patch also attached. Let me know if you want either > of these as a pull request, though both are trivial.
Thank you very much. Pull requests would indeed be better to engage others in the review process (the gihub integrated comment + diff is nicer to discuss code than emails IMHO). -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ RSA(R) Conference 2012 Save $700 by Nov 18 Register now http://p.sf.net/sfu/rsa-sfdev2dev1 _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
