[ https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14284470#comment-14284470 ]
Stephen Boesch commented on SPARK-4259: --------------------------------------- Xiangrui has provided valuable feedback. His latest recommendation points out that the Gaussian SImilarities will result in a small proportion of the input vertices having non-zero (or nearly zero) value . That ratio may then represent the out-degree of each vertex of the graph. The graph edges will represent the sparse (non-zero) matrix entries of the Normalized Affinity matrix W - so W_ij that have the non-zero entries. The algorithm thus bears similarities to PageRank. We are using the Power Iteration Clustering algorithm. In each iteration of the PIC the components of the estimated Eigenvector - represented by vertices in the Graph - are updated via Graph.aggregateMessages execution. Further input from Xiangrui: The graph is sparse, we don’t need to store edges with 0 similarity. We can assume that the average degree is D and then the number of edges is D N, where N is the number of vertices. It should be much less than N^2. > Add Spectral Clustering Algorithm with Gaussian Similarity Function > ------------------------------------------------------------------- > > Key: SPARK-4259 > URL: https://issues.apache.org/jira/browse/SPARK-4259 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: Fan Jiang > Assignee: Fan Jiang > Labels: features > > In recent years, spectral clustering has become one of the most popular > modern clustering algorithms. It is simple to implement, can be solved > efficiently by standard linear algebra software, and very often outperforms > traditional clustering algorithms such as the k-means algorithm. > We implemented the unnormalized graph Laplacian matrix by Gaussian similarity > function. A brief design looks like below: > Unnormalized spectral clustering > Input: raw data points, number k of clusters to construct: > • Comupte Similarity matrix S ∈ Rn×n, . > • Construct a similarity graph. Let W be its weighted adjacency matrix. > • Compute the unnormalized Laplacian L = D - W. where D is the Degree > diagonal matrix > • Compute the first k eigenvectors u1, . . . , uk of L. > • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. > • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th > row of U. > • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into > clusters C1, . . . , Ck. > Output: Clusters A1, . . . , Ak with Ai = { j | yj ∈ Ci }. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org