Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Spectral Clustering
(https://cwiki.apache.org/confluence/display/MAHOUT/Spectral+Clustering)
Edited by Shannon Quinn:
---------------------------------------------------------------------
Spectral clustering is a more powerful and specialized algorithm (compared to
K-means) which has significant use in photo editing, hence its name. Each
object to be clustered can initially be represented as an _n_\-dimensional
numeric vector, but the difference with this algorithm is that there must also
be some method for performing a comparison between each object and expressing
this comparison as a scalar.
This _n_ by _n_ comparison of all objects with all others forms the _affinity_
matrix, which can be intuitively thought of as a rough representation of an
underlying undirected, weighted, and fully-connected graph whose edges express
the relative relationships, or affinities, between each pair of objects in the
original data. This affinity matrix forms the basis from which the two spectral
clustering algorithms operate.
The equation by which the affinities are calculated can vary depending on the
user's circumstances; typically, the equation takes the form of:
exp( _d{_}{^}2^ / _c_ )
where _d_ is the Euclidean distance between a pair of points, and _c_ is a
scaling factor. _c_ is often calculated relative to a _k_\-neighborhood of
closest points to the current point; all other affinities are set to 0 outside
of the neighborhood. Again, this formula can vary depending on the situation
(e.g. a fully-connected graph would ignore the _k_\-neighborhood and calculate
affinities for all pairs of points).
[Full overview on spectral
clustering|http://spectrallyclustered.wordpress.com/2010/05/27/intro-and-spectral-clustering-101/]
h2. K-Means Spectral Clustering
h3. Overview
This consists of a few basic steps of generalized spectral clustering, followed
by standard k-means clustering over the intermediate results. Again, this
process begins with an affinity matrix *A* - whether or not it is
fully-connected depends on the user's need.
*A* is then transformed into a pseudo-Laplacian matrix via a multiplication
with a diagonal matrix whose entries consist of the sums of the rows of *A*.
The sums are modified to be the inverse square root of their original values.
The final operation looks something like:
L = D^{-1/2} A D^{-1/2}
*L* has some properties that are of interest to us; most importantly, while it
is symmetric like *A*, it has a more stable eigen-decomposition. *L* is
decomposed into its constituent eigenvectors and corresponding eigenvalues
(though the latter will not be needed for future calculations); the matrix of
eigenvectors, *U*, is what we are now interested in.
Assuming *U* is a column matrix (the eigenvectors comprise the columns), then
we will now use the _rows_ of *U* as proxy data for the original data points.
We will run each row through standard K-means clustering, and the label that
each proxy point receives will be transparently assigned to the corresponding
original data point, resulting in the final clustering assignments.
[Full overview on k-means spectral
clustering|http://spectrallyclustered.wordpress.com/2010/06/05/sprint-1-k-means-spectral-clustering/]
h3. Implementation
h2. Eigencuts Spectral Clustering
h3. Overview
[Full overview on Eigencuts spectral
clustering|http://spectrallyclustered.wordpress.com/2010/07/06/sprint-3-introduction-to-eigencuts/]
h3. Implementation
h2. Quickstart
h2. Examples
Change your notification preferences:
https://cwiki.apache.org/confluence/users/viewnotifications.action