[ https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133521#comment-15133521 ]
Thang Nguyen commented on FLINK-1733: ------------------------------------- | I'm not sure that {{DenseMatrix}} fits for sPCA What if the Matrix is relatively small? >From the paper, where d is the # of principal components: | matrix C, which is of size D × d (recall that d is typically small). For example, in our experiments with a 94 GB dataset, the size of matrix C was 30 MB, which can easily fit in memory. This matrix C is broadcasted to the workers and is used to redundantly recompute an intermediate matrix (in favor of cutting down communication complexity). The distributed algorithm also only requires accessing a single row at a time to compute a partial result, and then sums the partials at the end. Is the lack of a distributed matrix/vector implementation enough of a blocker to be worried, or should I continue? > Add PCA to machine learning library > ----------------------------------- > > Key: FLINK-1733 > URL: https://issues.apache.org/jira/browse/FLINK-1733 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library > Reporter: Till Rohrmann > Assignee: Thang Nguyen > Priority: Minor > Labels: ML > > Dimension reduction is a crucial prerequisite for many data analysis tasks. > Therefore, Flink's machine learning library should contain a principal > components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] > proposes a distributed PCA. A more recent publication [2] describes another > scalable PCA implementation. > Resources: > [1] [http://arxiv.org/pdf/1408.5823v5.pdf] > [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf] -- This message was sent by Atlassian JIRA (v6.3.4#6332)