Re: [Scikit-learn-general] Randomized PCA

Olivier Grisel Wed, 02 Nov 2011 19:23:14 -0700

2011/11/3 Kenneth C. Arnold <[email protected]>:
> On Wed, Nov 2, 2011 at 6:04 PM, Olivier Grisel <[email protected]> 
> wrote:
>> 2011/11/2 Radim Rehurek <[email protected]>:
>>> If you decide to implement the randomized PCA, I can offer some 
>>> observations:
>>>
>>> 1. oversampling does little, accuracy comes mostly from the extra power 
>>> iteration steps
>>> 2. no power iterations result in miserable accuracy
>>> 3. extra power iteration steps quickly lead to numerical overflows; but QR 
>>> is pretty fast, so in gensim, I orthonormalize the intermediate matrices H 
>>> after each power iteration step. That's exactly the same method that remark 
>>> 3.3 refers to.
>>
>> Interesting. The current implementation in scikit-learn (which is
>> neither streamed nor parallel) does quite a bit of oversampling (if k
>> components are rrequired,  2 * k random vectors are used) and uses 3
>> power iterations by default but does not do qr inside the power
>> iteration steps:
>>
>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/pca.py#L346
>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/extmath.py#L126
>
> I played with this a bit a while back. Trivial patch attached.
>
> How much the power iteration helps apparently depends on how quickly
> the singular spectrum decays. The matrices I was playing with had
> slowly decaying spectra; I found q=10 got acceptable results (judging
> by whether the singular values matched the non-randomized results). I
> didn't find that the orthonormalization step helped much for my case,
> and the QR did slow things down significantly, so I didn't pursue this
> patch further. Perhaps orthonormalizing every few iterations would be
> a good balance.
>
> In contrast to the comments in fast_svd, I did not find (k+p) >
> rank(M) to be necessary. The Halko paper suggests the same thing iirc.
>
> Also, I found the randomized range finder (algorithm 4.3) to be more
> generally useful, as Halko et al suggests, so I pulled it into a
> separate function. Patch also attached. Let me know if you want either
> of these as a pull request, though both are trivial.


Thank you very much. Pull requests would indeed be better to engage
others in the review process (the gihub integrated comment + diff is
nicer to discuss code than emails IMHO).

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Randomized PCA

Reply via email to