On Fri, Mar 6, 2015 at 11:09 AM, Luca Puggini <lucapug...@gmail.com> wrote:
> Hi,
> It seems to me that you are discussing topics that can be introduced in
> sklearn with GSoC.
>
> I use sklearn quiet a lot and there are a couple of things that I really
> miss in this library:
>
> 1- Nipals PCA.
> The current version of PCA is too low for high dimensional dataset.
> Suppose to have p=10000 variables and be interested in only the first 10
> principal components. In a situation like this nipals PCA is much more
> efficient. Also other algorithms like PLS can increase their computational
> performance with nipals PCA
>
>
PCA does an SVD, whose complexity depends on the shorter side of the
matrix. If you have n=100, p=10000, the complexity is O(n^2 * p). However,
if both dimensions are high, it is true that a decomposition that only
calculates the required number of components becomes necessary.
RandomizedPCA
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/pca.py#L468
does this using random projection, nipals would be an alternative.
PLS already uses nipals
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cross_decomposition/pls_.py#L22
In the context of a refactoring of PLS/CCA, it there could also be an
evaluation of the existing nipals in PCA.
2- Something to rank the variables
> At the moment it seems to me that the only way to rank the variables is
> the Random Forest importance. This method is known to be very very biased.
> I suggest something like the method implemented in the R library party.
>
>
Could you elaborate?
>
> I hope that these comments can help.
> I may decide to apply for GSoC as well :-)
>
> Best,
> Luca
>
>
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for
> all
> things parallel software development, from weekly thought leadership blogs
> to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general