Ted,
On Jun 17, 2009, at 2:51 AM, Ted Dunning wrote:
A principled approach to cluster evaluation is to measure how well the
cluster membership captures the structure of unseen data. A natural
measure
for this is to measure how much of the entropy of the data is
captured by
cluster membership. For k-means and its natural L_2 metric, the
natural
cluster quality metric is the squared distance from the nearest
centroid
adjusted by the log_2 of the number of clusters. This can be
compared to
the squared magnitude of the original data or the squared deviation
from the
centroid for all of the data. The idea is that you are changing the
representation of the data by allocating some of the bits in your
original
representation to represent which cluster each point is in. If
those bits
aren't made up by the residue being small then your clustering is
making a
bad trade-off.
In the past, I have used other more heuristic measures as well. One
of the
key characteristics that I would like to see out of a clustering is
a degree
of stability. Thus, I look at the fractions of points that are
assigned to
each cluster or the distribution of distances from the cluster
centroid.
These values should be relatively stable when applied to held-out
data.
For text, you can actually compute perplexity which measures how well
cluster membership predicts what words are used. This is nice
because you
don't have to worry about the entropy of real valued numbers.
Do you have any references on any of the above approaches?
Thanks,
Grant