[
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15628347#comment-15628347
]
Yanbo Liang edited comment on SPARK-15784 at 11/2/16 9:32 AM:
--------------------------------------------------------------
I'm prefer to #1 and #3, but it looks like we can achieve both goals.
Graph can be represented by GraphX/GraphFrame or DataFrame/RDD. PIC model can
be trained on both of them, but we use GraphX operators in the internal
implementation which means input data should be converted to GraphX
representation if it's RDD of tuples. So it's straight forward to make PIC as
one of the algorithms in GraphX(or GraphFrame when it is merged back into
Spark). However, users may load their graph as DataFrame/RDD and transform via
ML Pipeline which should also be supported, so it's better we can wrap PIC of
GraphX/GraphFrame as an Pipeline stage and then ML users can use it as well.
For some historical reasons(we don't want to add new features to GraphX), I
propose to split this task into the following step:
* Put PIC in Pipeline as a Transformer, use the GraphX operators in the
implementation (This is consistent with [~josephkb]'s proposal).
* Add PIC algorithms to GraphFrames when it is merged into Spark.
* Make the ML PIC as a wrapper to call the GraphFrames PIC implementation.
I think this scenario should be better for different users(ML users and
GraphFrames users), but still open to hear your thoughts. Thanks.
was (Author: yanboliang):
I'm prefer to #1 and #3, but it looks like we can achieve both goals.
Graph can be represented by GraphX/GraphFrame or DataFrame/RDD. PIC model can
be trained on both of them, but we use GraphX operators in the internal
implementation which means input data should be converted to GraphX
representation if it's RDD of tuples. So it's straight forward to make PIC as
one of the algorithms in GraphX(or GraphFrame when it is merged back into
Spark). However, users may load their graph as DataFrame/RDD and transform via
ML Pipeline which should also be supported, so it's better we can wrapper PIC
of GraphX/GraphFrame as an Pipeline stage and then ML users can use it as well.
For some historical reasons, I propose to split this task into the following
step:
* Put PIC in Pipeline as a Transformer, use the GraphX operators in the
implementation (This is consistent with [~josephkb]'s proposal).
* Add PIC algorithms to GraphFrames when it is merged into Spark.
* Make the ML PIC as a wrapper to call the GraphFrames PIC implementation.
I think this scenario should be better for different users(ML users and
GraphFrames users), but still open to hear your thoughts. Thanks.
> Add Power Iteration Clustering to spark.ml
> ------------------------------------------
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
> Issue Type: New Feature
> Components: ML
> Reporter: Xinh Huynh
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]