Yeah... Dan convinced me that the measure is invariant over cluster
permutations so it is fine.


On Fri, Apr 12, 2013 at 10:01 AM, Dan Filimon
<[email protected]>wrote:

> I think I follow what you said about the entropy-based measure.
>
> But, why don't we know the assignment in groups?
> Here's how I implemented it [1].
>
> I first got the lists of centroids in both cases, then for each data point,
> computed the centroid closest to it in that group and incremented the
> corresponding element of the confusion matrix.
> So the assignment is given by the centroids and the distance measure.
>
> It looks valid to me. Isn't it?
>
> [1]
>
> https://github.com/dfilimon/mahout/blob/skm/core/src/main/java/org/apache/mahout/clustering/streaming/cluster/ClusteringUtils.java#L198
>
>
> On Thu, Apr 11, 2013 at 6:12 AM, Ted Dunning <[email protected]>
> wrote:
>
> > I think based on spending 2 minutes reading wikipedia that the Rand Index
> > assumes that the assignment between groups in the two clusters is known.
> >
> > In fact, I think that this is typically not known.
> >
> > As an example, I created a dataset with 10,000 data points in 10
> > dimensions.  I created 10 highly separated clusters by putting Gaussian
> > distributions at randomly selected corners of the unit hyper-cube.
> >
> >     for (i in 1:10) {c[i,] = (rnorm(10)>0)+0}
> >
> > I repeated the generation of the corners until I got 10 different
> > centroids.  I checked this by looking at the rank of the matrix of
> centers:
> >
> >    svd(c)$d
> >
> > The last value here will be non-zero if all the rows of c are
> independent.
> >
> > Generating the data is pretty easy:
> >
> >     for (i in 1:10) {
> >         for (j in 1:1000) {
> >             data[(i-1)*1000+j,] = matrix(rnorm(10,sd=0.1), ncol=10) +
> c[i,]
> >         }
> >     }
> >
> > Normal k-means clustering will very commonly produce a poor clustering of
> > these data even though they are very separated.  If you do 10 restarts,
> you
> > will likely get a good result.  I did several clusters and retained the
> > cluster assignments:
> >
> >     k1=kmeans(data, centers=10, nstart=10)
> >     k2=kmeans(data, centers=10, nstart=1)
> >     k3=kmeans(data, centers=10, nstart=1)
> >
> > We can now produce plots of these clusterings by projecting the data
> using
> > SVD onto 2 dimensions and coloring with the cluster assignment  (I have
> put
> > a sample plot at
> https://dl.dropboxusercontent.com/u/36863361/clusters.png
> > ).
> >
> >    ss=svd(data)
> >    plot(data %*% ss$v[,1:2], col=rainbow(10)[k1$cluster])
> >
> > We can abuse sparse matrices to get a contingency table of cluster
> > assignments
> >
> >     require('Matrix')
> >     sparseMatrix(i=k1$cluster, j=k2$cluster, x=1)
> >
> > The results look something like this
> >
> >     10 x 10 sparse Matrix of class "dgCMatrix"
> >      [1,]    .   . 1000   .    .    .   .    .    .   .
> >      [2,]    . 541    .   .    .    .   .    .    . 459
> >      [3,]    .   .    . 499    .    . 501    .    .   .
> >      [4,]    .   .    .   .    .    .   .    . 1000   .
> >      [5,]    .   . 1000   .    .    .   .    .    .   .
> >      [6,]    .   .    .   .    . 1000   .    .    .   .
> >      [7,]    .   .    .   . 1000    .   .    .    .   .
> >      [8,]    .   .    .   .    .    .   . 1000    .   .
> >      [9,] 1000   .    .   .    .    .   .    .    .   .
> >     [10,]    .   .    .   . 1000    .   .    .    .   .
> >
> > Each of the 1000 values here represent a cluster that is recognized by
> both
> > runs.  The smaller number represent a split cluster.
> >
> > So the problem is how to tell that a diagonal matrix, arbitrarily
> permuted
> > is perfect and that this matrix isn't so good.
> >
> > My first inclination would be to look at mutual information based
> measures
> > like the multinomial log-likelihood ratio (AKA LLR in Mahout land).
> >
> > This leads to these kinds of numbers:
> >
> > > mi
> > function(m) {entropy(m)-entropy(colSums(m))-entropy(rowSums(m))}
> > > entropy
> > function(m) {m = m/sum(m);sum(m * log (m+(m==0)))}
> > > mi(sparseMatrix(i=k6$cluster, j=k1$cluster, x=1))
> > [1] 2.025326
> > > mi(sparseMatrix(i=k2$cluster, j=k1$cluster, x=1))
> > [1] 2.163956
> > > mi(sparseMatrix(i=k1$cluster, j=k1$cluster, x=1))
> > [1] 2.302585
> > > mi(sparseMatrix(i=k1$cluster, j=k6$cluster, x=1))
> > [1] 2.025326
> > > mi(sparseMatrix(i=k6$cluster, j=k6$cluster, x=1))
> > [1] 2.163619
> > > mi(matrix(rmultinom(1, 10000, prob=rep(0.01, 100)), nrow=10))
> > [1] 0.005527828
> > >
> >
> > The characteristics here are:
> >
> > Better clustering produces higher entropy.  Equivalent clusterings that
> > have different labels produce identical scores compared to each other as
> to
> > themselves.  Nearly identical clustering that have small problems produce
> > slightly lower scores.  Unrelated assignments produce near zero values.
> >
> >
> >
> > On Fri, Apr 5, 2013 at 4:47 AM, Dan Filimon <[email protected]
> > >wrote:
> >
> > > As part of evaluating cluster quality, I'd like to implement a bunch of
> > > quality measures, especially external ones.
> > >
> > > The one that I think would be particularly useful is the Adjusted Rand
> > > Index [1].
> > > Using a contingency table with the partitions from 2 clusterings, this
> > > returns a value from 0 to 1 (higher being better) corresponding to the
> > > similarity of the partitions.
> > >
> > > First of all, I'd like to know your thought on using ARI as a metric.
> > >
> > > Second, there's an implementation of ConfusionMatrix that is NxN. I'd
> > like
> > > to extend this class to support unlabeled partitions of different sizes
> > and
> > > add a method that computes the ARI.
> > >
> > > What are your thoughts?
> > >
> > > [1] http://en.wikipedia.org/wiki/Rand_index
> > >
> >
>

Reply via email to