[ 
https://issues.apache.org/jira/browse/SPARK-7856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7856:
-------------------------------------
    Issue Type: Improvement  (was: Bug)

> Scalable PCA implementation for tall and fat matrices
> -----------------------------------------------------
>
>                 Key: SPARK-7856
>                 URL: https://issues.apache.org/jira/browse/SPARK-7856
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Tarek Elgamal
>
> Currently the PCA implementation has a limitation of fitting d^2 
> covariance/grammian matrix entries in memory (d is the number of 
> columns/dimensions of the matrix). We often need only the largest k principal 
> components. To make pca really scalable, I suggest an implementation where 
> the memory usage is proportional to the principal components k rather than 
> the full dimensionality d. 
> I suggest adopting the solution described in this paper that is published in 
> SIGMOD 2015 (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf). 
> The paper offers an implementation for Probabilistic PCA (PPCA) which has 
> less memory and time complexity and could potentially scale to tall and fat 
> matrices rather than tall and skinny matrices that is supported by the 
> current PCA impelmentation. 
> Probablistic PCA could be potentially added to the set of algorithms 
> supported by MLlib and it does not necessarily replace the old PCA 
> implementation.
> PPCA implementation is adopted in Matlab's Statistics and Machine Learning 
> Toolbox (http://www.mathworks.com/help/stats/ppca.html)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to