[ 
https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14284470#comment-14284470
 ] 

Stephen Boesch commented on SPARK-4259:
---------------------------------------

Xiangrui has provided valuable feedback. His latest recommendation points out 
that the Gaussian SImilarities will result in a small proportion of the input 
vertices having non-zero (or nearly zero) value . That ratio may then represent 
the out-degree of each vertex of the graph.  The graph edges will represent the 
sparse (non-zero) matrix entries of the Normalized Affinity matrix W - so W_ij 
that have the non-zero entries.  The algorithm thus bears similarities to 
PageRank.  

We are using the Power Iteration Clustering algorithm. In each iteration of the 
PIC the components of the estimated Eigenvector - represented by vertices in 
the Graph - are updated via Graph.aggregateMessages execution. 

Further input from Xiangrui: 

The graph is sparse, we don’t need to store edges with 0 similarity. We can 
assume that the average degree is D and then the number of edges is D N, where 
N is the number of vertices. It should be much less than N^2.



> Add Spectral Clustering Algorithm with Gaussian Similarity Function
> -------------------------------------------------------------------
>
>                 Key: SPARK-4259
>                 URL: https://issues.apache.org/jira/browse/SPARK-4259
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Fan Jiang
>            Assignee: Fan Jiang
>              Labels: features
>
> In recent years, spectral clustering has become one of the most popular 
> modern clustering algorithms. It is simple to implement, can be solved 
> efficiently by standard linear algebra software, and very often outperforms 
> traditional clustering algorithms such as the k-means algorithm.
> We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
> function. A brief design looks like below:
> Unnormalized spectral clustering
> Input: raw data points, number k of clusters to construct: 
> • Comupte Similarity matrix S ∈ Rn×n, .
> • Construct a similarity graph. Let W be its weighted adjacency matrix.
> • Compute the unnormalized Laplacian L = D - W. where D is the Degree 
> diagonal matrix
> • Compute the first k eigenvectors u1, . . . , uk of L.
> • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
> • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th 
> row of U.
> • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
> clusters C1, . . . , Ck.
> Output: Clusters A1, . . . , Ak with Ai = { j | yj ∈ Ci }.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to