[jira] [Comment Edited] (SPARK-15784) Add Power Iteration Clustering to spark.ml

Yanbo Liang (JIRA) Wed, 02 Nov 2016 02:33:20 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15628347#comment-15628347
 ]


Yanbo Liang edited comment on SPARK-15784 at 11/2/16 9:32 AM:
--------------------------------------------------------------

I'm prefer to #1 and #3, but it looks like we can achieve both goals. 
Graph can be represented by GraphX/GraphFrame or DataFrame/RDD. PIC model can 
be trained on both of them, but we use GraphX operators in the internal 
implementation which means input data should be converted to GraphX 
representation if it's RDD of tuples. So it's straight forward to make PIC as 
one of the algorithms in GraphX(or GraphFrame when it is merged back into 
Spark). However, users may load their graph as DataFrame/RDD and transform via 
ML Pipeline which should also be supported, so it's better we can wrap PIC of 
GraphX/GraphFrame as an Pipeline stage and then ML users can use it as well. 
For some historical reasons(we don't want to add new features to GraphX), I 
propose to split this task into the following step:
* Put PIC in Pipeline as a Transformer, use the GraphX operators in the 
implementation (This is consistent with [~josephkb]'s proposal).
* Add PIC algorithms to GraphFrames when it is merged into Spark.
* Make the ML PIC as a wrapper to call the GraphFrames PIC implementation.

I think this scenario should be better for different users(ML users and 
GraphFrames users), but still open to hear your thoughts. Thanks.





was (Author: yanboliang):
I'm prefer to #1 and #3, but it looks like we can achieve both goals. 
Graph can be represented by GraphX/GraphFrame or DataFrame/RDD. PIC model can 
be trained on both of them, but we use GraphX operators in the internal 
implementation which means input data should be converted to GraphX 
representation if it's RDD of tuples. So it's straight forward to make PIC as 
one of the algorithms in GraphX(or GraphFrame when it is merged back into 
Spark). However, users may load their graph as DataFrame/RDD and transform via 
ML Pipeline which should also be supported, so it's better we can wrapper PIC 
of GraphX/GraphFrame as an Pipeline stage and then ML users can use it as well. 
For some historical reasons, I propose to split this task into the following 
step:
* Put PIC in Pipeline as a Transformer, use the GraphX operators in the 
implementation (This is consistent with [~josephkb]'s proposal).
* Add PIC algorithms to GraphFrames when it is merged into Spark.
* Make the ML PIC as a wrapper to call the GraphFrames PIC implementation.

I think this scenario should be better for different users(ML users and 
GraphFrames users), but still open to hear your thoughts. Thanks.




> Add Power Iteration Clustering to spark.ml
> ------------------------------------------
>
>                 Key: SPARK-15784
>                 URL: https://issues.apache.org/jira/browse/SPARK-15784
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: Xinh Huynh
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-15784) Add Power Iteration Clustering to spark.ml

Reply via email to