Shannon, The term spectral clustering comes from the spectral analysis of a graph, not from any optical connections.
On Sun, Aug 15, 2010 at 5:41 PM, <[email protected]> wrote: > Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT) > Page: Spectral Clustering ( > https://cwiki.apache.org/confluence/display/MAHOUT/Spectral+Clustering) > > > Edited by Shannon Quinn: > --------------------------------------------------------------------- > Spectral clustering is a more powerful and specialized algorithm (compared > to K-means) which has significant use in photo editing, hence its name. Each > object to be clustered can initially be represented as an _n_\-dimensional > numeric vector, but the difference with this algorithm is that there must > also be some method for performing a comparison between each object and > expressing this comparison as a scalar. > > This _n_ by _n_ comparison of all objects with all others forms the > _affinity_ matrix, which can be intuitively thought of as a rough > representation of an underlying undirected, weighted, and fully-connected > graph whose edges express the relative relationships, or affinities, between > each pair of objects in the original data. This affinity matrix forms the > basis from which the two spectral clustering algorithms operate. > > The equation by which the affinities are calculated can vary depending on > the user's circumstances; typically, the equation takes the form of: > > exp( _d{_}{^}2^ / _c_ ) > > where _d_ is the Euclidean distance between a pair of points, and _c_ is a > scaling factor. _c_ is often calculated relative to a _k_\-neighborhood of > closest points to the current point; all other affinities are set to 0 > outside of the neighborhood. Again, this formula can vary depending on the > situation (e.g. a fully-connected graph would ignore the _k_\-neighborhood > and calculate affinities for all pairs of points). > > [Full overview on spectral clustering| > http://spectrallyclustered.wordpress.com/2010/05/27/intro-and-spectral-clustering-101/ > ] > > h2. K-Means Spectral Clustering > > h3. Overview > > This consists of a few basic steps of generalized spectral clustering, > followed by standard k-means clustering over the intermediate results. > Again, this process begins with an affinity matrix *A* - whether or not it > is fully-connected depends on the user's need. > > *A* is then transformed into a pseudo-Laplacian matrix via a multiplication > with a diagonal matrix whose entries consist of the sums of the rows of *A*. > The sums are modified to be the inverse square root of their original > values. The final operation looks something like: > > L = D^{-1/2} A D^{-1/2} > > *L* has some properties that are of interest to us; most importantly, while > it is symmetric like *A*, it has a more stable eigen-decomposition. *L* is > decomposed into its constituent eigenvectors and corresponding eigenvalues > (though the latter will not be needed for future calculations); the matrix > of eigenvectors, *U*, is what we are now interested in. > > Assuming *U* is a column matrix (the eigenvectors comprise the columns), > then we will now use the _rows_ of *U* as proxy data for the original data > points. We will run each row through standard K-means clustering, and the > label that each proxy point receives will be transparently assigned to the > corresponding original data point, resulting in the final clustering > assignments. > > [Full overview on k-means spectral clustering| > http://spectrallyclustered.wordpress.com/2010/06/05/sprint-1-k-means-spectral-clustering/ > ] > > h3. Implementation > > The Mahout implementation consists of a single driver - > SpectralKMeansDriver - calling upon several common utilities. The driver > performs the operations in sequential fashion: reading in and constructing > the affinity matrix, building the diagonal matrix, building the > pseudo-Laplacian and decomposing it, and clustering the components. > > The affinity matrix input is the most important part of this algorithm. It > consists of text files which follow a specific format: that of a weighted, > undirected graph. In order to represent a graph in text files, each line of > a text file represents a single directional edge between two nodes. There > are three comma-separated values on the line. The first number indicates the > source node, the second is the destination node, and the third is the > weight. For example: > > 0, 1, 2.5 > > would indicate the directional edge from node 0 to node 1 has a weight of > 2.5. *Please note: as of 8/16/2010, Eigencuts assumes the affinity matrix is > symmetric, hence there should be a corresponding line in the text file of: > 1, 0, 2.5.* Also, each node should be an integer value. > > M/R jobs written for SpectralKMeans: > * AffinityMatrixInputJob (reads the raw input into a DistributedRowMatrix) > * MatrixDiagonalizeJob (constructs the diagonal matrix) > * UnitVectorizerJob (converts the eigenvector matrix *U* to unit rows) > * VectorMatrixMultiplicationJob (multiplies *D* with *A*) > > M/R jobs already in Mahout that were used: > * DistributedRowMatrix.transpose() > * DistributedLanczosSolver > * EigenVerfierJob > * KMeansDriver > > h2. Eigencuts Spectral Clustering > > h3. Overview > > Intuitively, Eigencuts can be thought of as part 2 of the K-means > algorithm, in that it performs the same initial steps up until the k-means > clustering. The algorithm uses the same affinity matrix *A*, constructs the > same diagonal matrix *D*, performs the same multiplication to create the > pseudo-Laplacian *L*, and conducts an eigen-decomposition on *L* to obtain > the eigenvectors and eigenvalues. But here is where the two algorithms > differentiate. > > For each eigenvector, we wish to determine how stable its flow of > probability is within the underlying graph of the original data. > Intuitively, this is intrinsically related to the min-cut, max-flow problem > of finding bottlenecks: if we perturb the flow rate on a specific edge, and > overall the flow is stable, then we can conclude that this edge was not a > bottleneck. If, however, perturbing an edge significantly alters the overall > flow, we know this edge's eigenflow is very unstable and is a bottleneck. > > We have an [explicit form| > http://spectrallyclustered.files.wordpress.com/2010/07/sensitivityequation.png] > of this "sensitivity" calculation ([full post here, under "computing > sensitivities"| > http://spectrallyclustered.wordpress.com/2010/07/15/sprint-3-three-last-mr-tasks/]). > The next step is called "non-maximal suppression", which effectively means > we will ignore any of the calculated sensitivities for which there exists > another more negative sensitivity in the same neighborhood as the current > one, effectively "suppressing" it. > > This non-maximal suppression then plays a role in the final > affinity-cutting step, where we "cut" the affinity (set to 0) between any > two nodes (effectively destroying the edge between them) for which the > sensitivity calculated at that edge passes some threshold, _and_ for which > the sensitivity was _not_ suppressed in the previous step. > > Once the cutting has been completed, the process loops upon itself > (starting with the recalculation of *D* using the modified *A*) until no new > cuts in *A* are made in the final step. > > [Full overview on Eigencuts spectral clustering| > http://spectrallyclustered.wordpress.com/2010/07/06/sprint-3-introduction-to-eigencuts/ > ] > > h3. Implementation > > Since the first half of Eigencuts uses the same calculations as Spectral > K-means, it uses the same common M/R tasks, both those specific to spectral > clustering, as well as those general to Mahout. Unlike SpectralKMeans, > however, there are no DistributedRowMatrix-specific operations performed, > and hence there is no need for the data type at all; Mahout Vectors are used > heavily instead. > > Once the initial affinity matrix is constructed, there is a loop within the > EigencutsDriver over the calculation of *D*, the creation of *L* and its > eigen-decomposition, the calculation of the sensitivities, and the actual > affinity cuts, such that the loop terminates only when no cuts are made to > the affinity matrix *A*. The final matrix *A* will then be representative of > a graph structure whose data points exist in intra-connected clusters. > > M/R tasks specific to Eigencuts: > * EigencutsSensitivityJob (calculates the perturbation effects on edge > weights) > * EigencutsAffinityCutsJob (sets edge weights to 0) > > M/R tasks within spectral clustering: > * AffinityMatrixInputJob (reads the raw input into a DistributedRowMatrix) > * MatrixDiagonalizeJob (constructs the diagonal matrix) > * VectorMatrixMultiplicationJob (multiplies *D* with *A*) > > M/R tasks general to Mahout: > * DistributedLanczosSolver > * EigenVerifierJob > > h2. Quickstart > > h2. Examples > > Change your notification preferences: > https://cwiki.apache.org/confluence/users/viewnotifications.action >
