[ 
https://issues.apache.org/jira/browse/SPARK-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326951#comment-14326951
 ] 

Liang-Chi Hsieh commented on SPARK-5832:
----------------------------------------

For clustering algorithm, O(N^2) complexity is not rare. So I suggest maybe we 
can consider clustering algorithm more with its usability, scalability, feature 
in addition to complexity?

One of the feature of AP is that it does not require users to specify the 
number of clusters. As I know, we have no clustering algorithms in MLlib with 
similar feature?

O(N^2) time/memory complexity is the worst case for AP. The worst case means 
that we are going to cluster on a complete digraph (directed graph, every pair 
of distinct vertices is connected by a pair of unique edges) or a complete 
graph (undirected graph, every pair of distinct vertices is connected by a 
unique edge). As we know, that is very rare for practical use case. At most 
cases, the graph we process would be sparse graph.

Unlike K-means and DBSCAN, AP is not a metric-space clustering techniques. For 
K-means and DBSCAN, they do not distinguish between dense and sparse graph. 
Because they need to calculate the distances between the possible cluster 
centers and all other nodes. But for AP, since the similarity graph is given, a 
sparse graph can significantly reduce the time/memory complexity required to 
perform clustering.

I have no strong opinion towards whether we should include AP in MLlib. As 
Xiangrui Meng suggested, it is also a good idea to just maintain the 
implementation outside Spark as a third-party package.



> Add Affinity Propagation clustering algorithm
> ---------------------------------------------
>
>                 Key: SPARK-5832
>                 URL: https://issues.apache.org/jira/browse/SPARK-5832
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Liang-Chi Hsieh
>            Assignee: Liang-Chi Hsieh
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to