Re: [CONF] Apache Mahout > Spectral Clustering

Ted Dunning Sun, 15 Aug 2010 17:55:45 -0700

Shannon,

The term spectral clustering comes from the spectral analysis of a graph,
not from any optical connections.


On Sun, Aug 15, 2010 at 5:41 PM, <[email protected]> wrote:

> Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
> Page: Spectral Clustering (
> https://cwiki.apache.org/confluence/display/MAHOUT/Spectral+Clustering)
>
>
> Edited by Shannon Quinn:
> ---------------------------------------------------------------------
> Spectral clustering is a more powerful and specialized algorithm (compared
> to K-means) which has significant use in photo editing, hence its name. Each
> object to be clustered can initially be represented as an _n_\-dimensional
> numeric vector, but the difference with this algorithm is that there must
> also be some method for performing a comparison between each object and
> expressing this comparison as a scalar.
>
> This _n_ by _n_ comparison of all objects with all others forms the
> _affinity_ matrix, which can be intuitively thought of as a rough
> representation of an underlying undirected, weighted, and fully-connected
> graph whose edges express the relative relationships, or affinities, between
> each pair of objects in the original data. This affinity matrix forms the
> basis from which the two spectral clustering algorithms operate.
>
> The equation by which the affinities are calculated can vary depending on
> the user's circumstances; typically, the equation takes the form of:
>
> exp( _d{_}{^}2^ / _c_ )
>
> where _d_ is the Euclidean distance between a pair of points, and _c_ is a
> scaling factor. _c_ is often calculated relative to a _k_\-neighborhood of
> closest points to the current point; all other affinities are set to 0
> outside of the neighborhood. Again, this formula can vary depending on the
> situation (e.g. a fully-connected graph would ignore the _k_\-neighborhood
> and calculate affinities for all pairs of points).
>
> [Full overview on spectral clustering|
> http://spectrallyclustered.wordpress.com/2010/05/27/intro-and-spectral-clustering-101/
> ]
>
> h2. K-Means Spectral Clustering
>
> h3. Overview
>
> This consists of a few basic steps of generalized spectral clustering,
> followed by standard k-means clustering over the intermediate results.
> Again, this process begins with an affinity matrix *A* - whether or not it
> is fully-connected depends on the user's need.
>
> *A* is then transformed into a pseudo-Laplacian matrix via a multiplication
> with a diagonal matrix whose entries consist of the sums of the rows of *A*.
> The sums are modified to be the inverse square root of their original
> values. The final operation looks something like:
>
> L = D^{-1/2} A D^{-1/2}
>
> *L* has some properties that are of interest to us; most importantly, while
> it is symmetric like *A*, it has a more stable eigen-decomposition. *L* is
> decomposed into its constituent eigenvectors and corresponding eigenvalues
> (though the latter will not be needed for future calculations); the matrix
> of eigenvectors, *U*, is what we are now interested in.
>
> Assuming *U* is a column matrix (the eigenvectors comprise the columns),
> then we will now use the _rows_ of *U* as proxy data for the original data
> points. We will run each row through standard K-means clustering, and the
> label that each proxy point receives will be transparently assigned to the
> corresponding original data point, resulting in the final clustering
> assignments.
>
> [Full overview on k-means spectral clustering|
> http://spectrallyclustered.wordpress.com/2010/06/05/sprint-1-k-means-spectral-clustering/
> ]
>
> h3. Implementation
>
> The Mahout implementation consists of a single driver -
> SpectralKMeansDriver - calling upon several common utilities. The driver
> performs the operations in sequential fashion: reading in and constructing
> the affinity matrix, building the diagonal matrix, building the
> pseudo-Laplacian and decomposing it, and clustering the components.
>
> The affinity matrix input is the most important part of this algorithm. It
> consists of text files which follow a specific format: that of a weighted,
> undirected graph. In order to represent a graph in text files, each line of
> a text file represents a single directional edge between two nodes. There
> are three comma-separated values on the line. The first number indicates the
> source node, the second is the destination node, and the third is the
> weight. For example:
>
> 0, 1, 2.5
>
> would indicate the directional edge from node 0 to node 1 has a weight of
> 2.5. *Please note: as of 8/16/2010, Eigencuts assumes the affinity matrix is
> symmetric, hence there should be a corresponding line in the text file of:
> 1, 0, 2.5.* Also, each node should be an integer value.
>
> M/R jobs written for SpectralKMeans:
> * AffinityMatrixInputJob (reads the raw input into a DistributedRowMatrix)
> * MatrixDiagonalizeJob (constructs the diagonal matrix)
> * UnitVectorizerJob (converts the eigenvector matrix *U* to unit rows)
> * VectorMatrixMultiplicationJob (multiplies *D* with *A*)
>
> M/R jobs already in Mahout that were used:
> * DistributedRowMatrix.transpose()
> * DistributedLanczosSolver
> * EigenVerfierJob
> * KMeansDriver
>
> h2. Eigencuts Spectral Clustering
>
> h3. Overview
>
> Intuitively, Eigencuts can be thought of as part 2 of the K-means
> algorithm, in that it performs the same initial steps up until the k-means
> clustering. The algorithm uses the same affinity matrix *A*, constructs the
> same diagonal matrix *D*, performs the same multiplication to create the
> pseudo-Laplacian *L*, and conducts an eigen-decomposition on *L* to obtain
> the eigenvectors and eigenvalues. But here is where the two algorithms
> differentiate.
>
> For each eigenvector, we wish to determine how stable its flow of
> probability is within the underlying graph of the original data.
> Intuitively, this is intrinsically related to the min-cut, max-flow problem
> of finding bottlenecks: if we perturb the flow rate on a specific edge, and
> overall the flow is stable, then we can conclude that this edge was not a
> bottleneck. If, however, perturbing an edge significantly alters the overall
> flow, we know this edge's eigenflow is very unstable and is a bottleneck.
>
> We have an [explicit form|
> http://spectrallyclustered.files.wordpress.com/2010/07/sensitivityequation.png]
> of this "sensitivity" calculation ([full post here, under "computing
> sensitivities"|
> http://spectrallyclustered.wordpress.com/2010/07/15/sprint-3-three-last-mr-tasks/]).
> The next step is called "non-maximal suppression", which effectively means
> we will ignore any of the calculated sensitivities for which there exists
> another more negative sensitivity in the same neighborhood as the current
> one, effectively "suppressing" it.
>
> This non-maximal suppression then plays a role in the final
> affinity-cutting step, where we "cut" the affinity (set to 0) between any
> two nodes (effectively destroying the edge between them) for which the
> sensitivity calculated at that edge passes some threshold, _and_ for which
> the sensitivity was _not_ suppressed in the previous step.
>
> Once the cutting has been completed, the process loops upon itself
> (starting with the recalculation of *D* using the modified *A*) until no new
> cuts in *A* are made in the final step.
>
> [Full overview on Eigencuts spectral clustering|
> http://spectrallyclustered.wordpress.com/2010/07/06/sprint-3-introduction-to-eigencuts/
> ]
>
> h3. Implementation
>
> Since the first half of Eigencuts uses the same calculations as Spectral
> K-means, it uses the same common M/R tasks, both those specific to spectral
> clustering, as well as those general to Mahout. Unlike SpectralKMeans,
> however, there are no DistributedRowMatrix-specific operations performed,
> and hence there is no need for the data type at all; Mahout Vectors are used
> heavily instead.
>
> Once the initial affinity matrix is constructed, there is a loop within the
> EigencutsDriver over the calculation of *D*, the creation of *L* and its
> eigen-decomposition, the calculation of the sensitivities, and the actual
> affinity cuts, such that the loop terminates only when no cuts are made to
> the affinity matrix *A*. The final matrix *A* will then be representative of
> a graph structure whose data points exist in intra-connected clusters.
>
> M/R tasks specific to Eigencuts:
> * EigencutsSensitivityJob (calculates the perturbation effects on edge
> weights)
> * EigencutsAffinityCutsJob (sets edge weights to 0)
>
> M/R tasks within spectral clustering:
> * AffinityMatrixInputJob (reads the raw input into a DistributedRowMatrix)
> * MatrixDiagonalizeJob (constructs the diagonal matrix)
> * VectorMatrixMultiplicationJob (multiplies *D* with *A*)
>
> M/R tasks general to Mahout:
> * DistributedLanczosSolver
> * EigenVerifierJob
>
> h2. Quickstart
>
> h2. Examples
>
> Change your notification preferences:
> https://cwiki.apache.org/confluence/users/viewnotifications.action
>

Re: [CONF] Apache Mahout > Spectral Clustering

Reply via email to