[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332185#comment-15332185
 ] 

Miao Wang commented on SPARK-15784:
-----------------------------------

[~josephkb][~mengxr][~yanboliang] I am trying to add PIC to spark.ml and I have 
some questions regarding model.predict and saveImpl. The basic PIC algorithm 
has the following steps:

Input: A row-normalized affinity matrix W and the number of clusters k
Output: Clusters C1, C2, …, Ck

Pick an initial vector v0
Repeat
Set vt+1 ← Wvt
Set δt+1 ← |vt+1 – vt|
Increment t
Stop when |δt – δt-1| ≈ 0
Use k-means to cluster points on vt and return clusters C1, C2, …, Ck

In the last step, k-means takes the pseudo-eigenvector `v ` generated by PIC to 
do the classification. Therefore, the model.predict should use the trained 
k-means to do the prediction. However, the vector `v` should run PIC again on 
the data to be predicted. So, there is no trained model for predicting new data 
set. model.predict is actually training again using the PIC.fit method. In this 
case, PIC.fit and PIC.predict actually call the same run method in MLLib 
implementation. 

Since we have to train data anyway, the model save is not useful as there is no 
model to be save. In the MLLib implementation, save function saves the 
assignment results of the current data set, which can't be used for new data 
clustering. The only usage of the result is when the same data is given, we 
don't have to train again. However, we don't know whether it is the previous 
training data from the saved model.

Please correct me if I misunderstand anything. Thanks!

Miao




> Add Power Iteration Clustering to spark.ml
> ------------------------------------------
>
>                 Key: SPARK-15784
>                 URL: https://issues.apache.org/jira/browse/SPARK-15784
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: Xinh Huynh
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to