[CONF] Apache Mahout > Spectral Clustering

confluence Sun, 15 Aug 2010 17:19:24 -0700

Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Spectral Clustering 
(https://cwiki.apache.org/confluence/display/MAHOUT/Spectral+Clustering)

Edited by Shannon Quinn:
---------------------------------------------------------------------
Spectral clustering is a more powerful and specialized algorithm (compared to
K-means) which has significant use in photo editing, hence its name. Each
object to be clustered can initially be represented as an _n_\-dimensional
numeric vector, but the difference with this algorithm is that there must also
be some method for performing a comparison between each object and expressing
this comparison as a scalar.

This _n_ by _n_ comparison of all objects with all others forms the _affinity_
matrix, which can be intuitively thought of as a rough representation of an
underlying undirected, weighted, and fully-connected graph whose edges express
the relative relationships, or affinities, between each pair of objects in the
original data. This affinity matrix forms the basis from which the two spectral
clustering algorithms operate.

The equation by which the affinities are calculated can vary depending on the
user's circumstances; typically, the equation takes the form of:

exp( _d{_}{^}2^ / _c_ )

where _d_ is the Euclidean distance between a pair of points, and _c_ is a
scaling factor. _c_ is often calculated relative to a _k_\-neighborhood of
closest points to the current point; all other affinities are set to 0 outside
of the neighborhood. Again, this formula can vary depending on the situation
(e.g. a fully-connected graph would ignore the _k_\-neighborhood and calculate
affinities for all pairs of points).

[Full overview on spectral
clustering|http://spectrallyclustered.wordpress.com/2010/05/27/intro-and-spectral-clustering-101/]

h2. K-Means Spectral Clustering

h3. Overview

This consists of a few basic steps of generalized spectral clustering, followed
by standard k-means clustering over the intermediate results. Again, this
process begins with an affinity matrix *A* - whether or not it is
fully-connected depends on the user's need.

*A* is then transformed into a pseudo-Laplacian matrix via a multiplication
with a diagonal matrix whose entries consist of the sums of the rows of *A*.
The sums are modified to be the inverse square root of their original values.
The final operation looks something like:

L = D^{-1/2} A D^{-1/2}

*L* has some properties that are of interest to us; most importantly, while it
is symmetric like *A*, it has a more stable eigen-decomposition. *L* is
decomposed into its constituent eigenvectors and corresponding eigenvalues
(though the latter will not be needed for future calculations); the matrix of
eigenvectors, *U*, is what we are now interested in.

Assuming *U* is a column matrix (the eigenvectors comprise the columns), then
we will now use the _rows_ of *U* as proxy data for the original data points.
We will run each row through standard K-means clustering, and the label that
each proxy point receives will be transparently assigned to the corresponding
original data point, resulting in the final clustering assignments.

[Full overview on k-means spectral
clustering|http://spectrallyclustered.wordpress.com/2010/06/05/sprint-1-k-means-spectral-clustering/]

h3. Implementation

The Mahout implementation consists of a single driver - SpectralKMeansDriver -
calling upon several common utilities. The driver performs the operations in
sequential fashion: reading in and constructing the affinity matrix, building
the diagonal matrix, building the pseudo-Laplacian and decomposing it, and
clustering the components.

The affinity matrix input is the most important part of this algorithm. It
consists of text files which follow a specific format: that of a weighted,
undirected graph. In order to represent a graph in text files, each line of a
text file represents a single directional edge between two nodes. There are
three comma-separated values on the line. The first number indicates the source
node, the second is the destination node, and the third is the weight. For
example:

0, 1, 2.5

would indicate the directional edge from node 0 to node 1 has a weight of 2.5.
*Please note: as of 8/16/2010, Eigencuts assumes the affinity matrix is
symmetric, hence there should be a corresponding line in the text file of: 1,
0, 2.5.* Also, each node should be an integer value.

M/R jobs written for SpectralKMeans:
* AffinityMatrixInputJob (reads the raw input into a DistributedRowMatrix)
* MatrixDiagonalizeJob (constructs the diagonal matrix)
* UnitVectorizerJob (converts the eigenvector matrix *U* to unit rows)
* VectorMatrixMultiplicationJob (multiplies *D* with *A*)

M/R jobs already in Mahout that were used:
* DistributedRowMatrix.transpose()
* DistributedLanczosSolver
* EigenVerfierJob
* KMeansDriver

h2. Eigencuts Spectral Clustering

h3. Overview

[Full overview on Eigencuts spectral
clustering|http://spectrallyclustered.wordpress.com/2010/07/06/sprint-3-introduction-to-eigencuts/]

h3. Implementation

h2. Quickstart

h2. Examples

Change your notification preferences:
https://cwiki.apache.org/confluence/users/viewnotifications.action

[CONF] Apache Mahout > Spectral Clustering

Reply via email to