[
https://issues.apache.org/jira/browse/SPARK-7856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15961768#comment-15961768
]
Hayri Volkan Agun commented on SPARK-7856:
------------------------------------------
Hi Tarek,
Still on the issue of Probabilistic PCA. It would be very useful if there is an
implementation based on the number of principal components.
> Scalable PCA implementation for tall and fat matrices
> -----------------------------------------------------
>
> Key: SPARK-7856
> URL: https://issues.apache.org/jira/browse/SPARK-7856
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Reporter: Tarek Elgamal
>
> Currently the PCA implementation has a limitation of fitting d^2
> covariance/grammian matrix entries in memory (d is the number of
> columns/dimensions of the matrix). We often need only the largest k principal
> components. To make pca really scalable, I suggest an implementation where
> the memory usage is proportional to the principal components k rather than
> the full dimensionality d.
> I suggest adopting the solution described in this paper that is published in
> SIGMOD 2015 (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf).
> The paper offers an implementation for Probabilistic PCA (PPCA) which has
> less memory and time complexity and could potentially scale to tall and fat
> matrices rather than tall and skinny matrices that is supported by the
> current PCA impelmentation.
> Probablistic PCA could be potentially added to the set of algorithms
> supported by MLlib and it does not necessarily replace the old PCA
> implementation.
> PPCA implementation is adopted in Matlab's Statistics and Machine Learning
> Toolbox (http://www.mathworks.com/help/stats/ppca.html)
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]