[
https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133465#comment-15133465
]
Chiwan Park commented on FLINK-1733:
------------------------------------
Hi [~thang],
You can use {{breeze.linalg.DenseMatrix}}. But you have to convert it to
Flink`s {{DenseMatrix}} at end of computation. I recommend to implement a
implicit conversion method between breeze`s {{DenseMatrix}} and Flink's
{{DenseMatrix}} You can find the conversion implementation for Flink's
{{DenseVector}} in {{DenseVector.scala}}.
But I'm not sure that {{DenseMatrix}} fits for sPCA. To achieve scalability, we
need distributed matrix and vector implementation. Currently there is no
implementation for distributed matrix and vector implementation in FlinkML.
(https://issues.apache.org/jira/browse/FLINK-1873)
The distribution status of DataSet depends on the source of data. If the data
are from distributed file system, the data are well distributed by the file
system and Flink also uses the status of distribution. In typical case, you
don't need to care distribution of the data.
> Add PCA to machine learning library
> -----------------------------------
>
> Key: FLINK-1733
> URL: https://issues.apache.org/jira/browse/FLINK-1733
> Project: Flink
> Issue Type: New Feature
> Components: Machine Learning Library
> Reporter: Till Rohrmann
> Assignee: Thang Nguyen
> Priority: Minor
> Labels: ML
>
> Dimension reduction is a crucial prerequisite for many data analysis tasks.
> Therefore, Flink's machine learning library should contain a principal
> components analysis (PCA) implementation. Maria-Florina Balcan et al. [1]
> proposes a distributed PCA. A more recent publication [2] describes another
> scalable PCA implementation.
> Resources:
> [1] [http://arxiv.org/pdf/1408.5823v5.pdf]
> [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)