Yeah... Dan convinced me that the measure is invariant over cluster permutations so it is fine.
On Fri, Apr 12, 2013 at 10:01 AM, Dan Filimon <[email protected]>wrote: > I think I follow what you said about the entropy-based measure. > > But, why don't we know the assignment in groups? > Here's how I implemented it [1]. > > I first got the lists of centroids in both cases, then for each data point, > computed the centroid closest to it in that group and incremented the > corresponding element of the confusion matrix. > So the assignment is given by the centroids and the distance measure. > > It looks valid to me. Isn't it? > > [1] > > https://github.com/dfilimon/mahout/blob/skm/core/src/main/java/org/apache/mahout/clustering/streaming/cluster/ClusteringUtils.java#L198 > > > On Thu, Apr 11, 2013 at 6:12 AM, Ted Dunning <[email protected]> > wrote: > > > I think based on spending 2 minutes reading wikipedia that the Rand Index > > assumes that the assignment between groups in the two clusters is known. > > > > In fact, I think that this is typically not known. > > > > As an example, I created a dataset with 10,000 data points in 10 > > dimensions. I created 10 highly separated clusters by putting Gaussian > > distributions at randomly selected corners of the unit hyper-cube. > > > > for (i in 1:10) {c[i,] = (rnorm(10)>0)+0} > > > > I repeated the generation of the corners until I got 10 different > > centroids. I checked this by looking at the rank of the matrix of > centers: > > > > svd(c)$d > > > > The last value here will be non-zero if all the rows of c are > independent. > > > > Generating the data is pretty easy: > > > > for (i in 1:10) { > > for (j in 1:1000) { > > data[(i-1)*1000+j,] = matrix(rnorm(10,sd=0.1), ncol=10) + > c[i,] > > } > > } > > > > Normal k-means clustering will very commonly produce a poor clustering of > > these data even though they are very separated. If you do 10 restarts, > you > > will likely get a good result. I did several clusters and retained the > > cluster assignments: > > > > k1=kmeans(data, centers=10, nstart=10) > > k2=kmeans(data, centers=10, nstart=1) > > k3=kmeans(data, centers=10, nstart=1) > > > > We can now produce plots of these clusterings by projecting the data > using > > SVD onto 2 dimensions and coloring with the cluster assignment (I have > put > > a sample plot at > https://dl.dropboxusercontent.com/u/36863361/clusters.png > > ). > > > > ss=svd(data) > > plot(data %*% ss$v[,1:2], col=rainbow(10)[k1$cluster]) > > > > We can abuse sparse matrices to get a contingency table of cluster > > assignments > > > > require('Matrix') > > sparseMatrix(i=k1$cluster, j=k2$cluster, x=1) > > > > The results look something like this > > > > 10 x 10 sparse Matrix of class "dgCMatrix" > > [1,] . . 1000 . . . . . . . > > [2,] . 541 . . . . . . . 459 > > [3,] . . . 499 . . 501 . . . > > [4,] . . . . . . . . 1000 . > > [5,] . . 1000 . . . . . . . > > [6,] . . . . . 1000 . . . . > > [7,] . . . . 1000 . . . . . > > [8,] . . . . . . . 1000 . . > > [9,] 1000 . . . . . . . . . > > [10,] . . . . 1000 . . . . . > > > > Each of the 1000 values here represent a cluster that is recognized by > both > > runs. The smaller number represent a split cluster. > > > > So the problem is how to tell that a diagonal matrix, arbitrarily > permuted > > is perfect and that this matrix isn't so good. > > > > My first inclination would be to look at mutual information based > measures > > like the multinomial log-likelihood ratio (AKA LLR in Mahout land). > > > > This leads to these kinds of numbers: > > > > > mi > > function(m) {entropy(m)-entropy(colSums(m))-entropy(rowSums(m))} > > > entropy > > function(m) {m = m/sum(m);sum(m * log (m+(m==0)))} > > > mi(sparseMatrix(i=k6$cluster, j=k1$cluster, x=1)) > > [1] 2.025326 > > > mi(sparseMatrix(i=k2$cluster, j=k1$cluster, x=1)) > > [1] 2.163956 > > > mi(sparseMatrix(i=k1$cluster, j=k1$cluster, x=1)) > > [1] 2.302585 > > > mi(sparseMatrix(i=k1$cluster, j=k6$cluster, x=1)) > > [1] 2.025326 > > > mi(sparseMatrix(i=k6$cluster, j=k6$cluster, x=1)) > > [1] 2.163619 > > > mi(matrix(rmultinom(1, 10000, prob=rep(0.01, 100)), nrow=10)) > > [1] 0.005527828 > > > > > > > The characteristics here are: > > > > Better clustering produces higher entropy. Equivalent clusterings that > > have different labels produce identical scores compared to each other as > to > > themselves. Nearly identical clustering that have small problems produce > > slightly lower scores. Unrelated assignments produce near zero values. > > > > > > > > On Fri, Apr 5, 2013 at 4:47 AM, Dan Filimon <[email protected] > > >wrote: > > > > > As part of evaluating cluster quality, I'd like to implement a bunch of > > > quality measures, especially external ones. > > > > > > The one that I think would be particularly useful is the Adjusted Rand > > > Index [1]. > > > Using a contingency table with the partitions from 2 clusterings, this > > > returns a value from 0 to 1 (higher being better) corresponding to the > > > similarity of the partitions. > > > > > > First of all, I'd like to know your thought on using ARI as a metric. > > > > > > Second, there's an implementation of ConfusionMatrix that is NxN. I'd > > like > > > to extend this class to support unlabeled partitions of different sizes > > and > > > add a method that computes the ARI. > > > > > > What are your thoughts? > > > > > > [1] http://en.wikipedia.org/wiki/Rand_index > > > > > >
