I think based on spending 2 minutes reading wikipedia that the Rand Index
assumes that the assignment between groups in the two clusters is known.

In fact, I think that this is typically not known.

As an example, I created a dataset with 10,000 data points in 10
dimensions.  I created 10 highly separated clusters by putting Gaussian
distributions at randomly selected corners of the unit hyper-cube.

    for (i in 1:10) {c[i,] = (rnorm(10)>0)+0}

I repeated the generation of the corners until I got 10 different
centroids.  I checked this by looking at the rank of the matrix of centers:

   svd(c)$d

The last value here will be non-zero if all the rows of c are independent.

Generating the data is pretty easy:

    for (i in 1:10) {
        for (j in 1:1000) {
            data[(i-1)*1000+j,] = matrix(rnorm(10,sd=0.1), ncol=10) + c[i,]
        }
    }

Normal k-means clustering will very commonly produce a poor clustering of
these data even though they are very separated.  If you do 10 restarts, you
will likely get a good result.  I did several clusters and retained the
cluster assignments:

    k1=kmeans(data, centers=10, nstart=10)
    k2=kmeans(data, centers=10, nstart=1)
    k3=kmeans(data, centers=10, nstart=1)

We can now produce plots of these clusterings by projecting the data using
SVD onto 2 dimensions and coloring with the cluster assignment  (I have put
a sample plot at https://dl.dropboxusercontent.com/u/36863361/clusters.png).

   ss=svd(data)
   plot(data %*% ss$v[,1:2], col=rainbow(10)[k1$cluster])

We can abuse sparse matrices to get a contingency table of cluster
assignments

    require('Matrix')
    sparseMatrix(i=k1$cluster, j=k2$cluster, x=1)

The results look something like this

    10 x 10 sparse Matrix of class "dgCMatrix"
     [1,]    .   . 1000   .    .    .   .    .    .   .
     [2,]    . 541    .   .    .    .   .    .    . 459
     [3,]    .   .    . 499    .    . 501    .    .   .
     [4,]    .   .    .   .    .    .   .    . 1000   .
     [5,]    .   . 1000   .    .    .   .    .    .   .
     [6,]    .   .    .   .    . 1000   .    .    .   .
     [7,]    .   .    .   . 1000    .   .    .    .   .
     [8,]    .   .    .   .    .    .   . 1000    .   .
     [9,] 1000   .    .   .    .    .   .    .    .   .
    [10,]    .   .    .   . 1000    .   .    .    .   .

Each of the 1000 values here represent a cluster that is recognized by both
runs.  The smaller number represent a split cluster.

So the problem is how to tell that a diagonal matrix, arbitrarily permuted
is perfect and that this matrix isn't so good.

My first inclination would be to look at mutual information based measures
like the multinomial log-likelihood ratio (AKA LLR in Mahout land).

This leads to these kinds of numbers:

> mi
function(m) {entropy(m)-entropy(colSums(m))-entropy(rowSums(m))}
> entropy
function(m) {m = m/sum(m);sum(m * log (m+(m==0)))}
> mi(sparseMatrix(i=k6$cluster, j=k1$cluster, x=1))
[1] 2.025326
> mi(sparseMatrix(i=k2$cluster, j=k1$cluster, x=1))
[1] 2.163956
> mi(sparseMatrix(i=k1$cluster, j=k1$cluster, x=1))
[1] 2.302585
> mi(sparseMatrix(i=k1$cluster, j=k6$cluster, x=1))
[1] 2.025326
> mi(sparseMatrix(i=k6$cluster, j=k6$cluster, x=1))
[1] 2.163619
> mi(matrix(rmultinom(1, 10000, prob=rep(0.01, 100)), nrow=10))
[1] 0.005527828
>

The characteristics here are:

Better clustering produces higher entropy.  Equivalent clusterings that
have different labels produce identical scores compared to each other as to
themselves.  Nearly identical clustering that have small problems produce
slightly lower scores.  Unrelated assignments produce near zero values.



On Fri, Apr 5, 2013 at 4:47 AM, Dan Filimon <[email protected]>wrote:

> As part of evaluating cluster quality, I'd like to implement a bunch of
> quality measures, especially external ones.
>
> The one that I think would be particularly useful is the Adjusted Rand
> Index [1].
> Using a contingency table with the partitions from 2 clusterings, this
> returns a value from 0 to 1 (higher being better) corresponding to the
> similarity of the partitions.
>
> First of all, I'd like to know your thought on using ARI as a metric.
>
> Second, there's an implementation of ConfusionMatrix that is NxN. I'd like
> to extend this class to support unlabeled partitions of different sizes and
> add a method that computes the ARI.
>
> What are your thoughts?
>
> [1] http://en.wikipedia.org/wiki/Rand_index
>

Reply via email to