Re: Validating clustering output

Ted Dunning Sat, 09 Jan 2010 12:38:10 -0800

What I meant was that the distribution of distances for the members of a
cluster in the training data should be similar to the distribution of the
distances for the members of the cluster in the test data.  The distribution
of the number of members in each cluster should also be similar for training
and test.

This is not a panacea in terms of evaluating clustering, but it will
definitely highlight cases where the original clustering had too many
clusters and thus was over-fitting the data.  For instance, if a cluster has
a single member, the centroid will be right on top of that point and the
distance will be zero.  In the test data it is very unlikely that any point
will be in exactly that same place so you will have a different value.
Typically what you want is not so much a traditional reject the null kind of
test (like Kolmogorov Smirnov statistics, for example), but rather to be
able to test if the test and training data definitively show similar
distributions (i.e. grouping not-similar and can't-tell into the reject
pile).  One handy way to do this is to use the Kolmogorov Smirnov statistic,
but not normalize for sample size.

For testing the number of members in a cluster, you can use LLR, but I would
tend to prefer doing binomial mutual information on each cluster, comparing
number of members of the cluster versus number of non-members and training
versus test.  This will give you the same affirmative properties.  Mutual
information is the same as LLR divided by twice the number of samples.

Again, these tests will help you avoid over-fitting, but won't help you
determine whether one cluster or two is better (if you have lots of data,
the two case should be always better).  One heuristic is to use the largest
number of clusters such that none fail the over-fitting test.

On Sat, Jan 9, 2010 at 9:18 AM, Grant Ingersoll <[email protected]> wrote:

> >  This can be compared to
> > the squared magnitude of the original data
> > or the squared deviation from the
> > centroid for all of the data.
>
> Not sure I am following these two ideas enough to code it up.
>
> For magnitude, do you mean compare the above against the magnitude of each
> vector or do you mean something else?
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Validating clustering output

Reply via email to