What I meant was that the distribution of distances for the members of a cluster in the training data should be similar to the distribution of the distances for the members of the cluster in the test data. The distribution of the number of members in each cluster should also be similar for training and test.
This is not a panacea in terms of evaluating clustering, but it will definitely highlight cases where the original clustering had too many clusters and thus was over-fitting the data. For instance, if a cluster has a single member, the centroid will be right on top of that point and the distance will be zero. In the test data it is very unlikely that any point will be in exactly that same place so you will have a different value. Typically what you want is not so much a traditional reject the null kind of test (like Kolmogorov Smirnov statistics, for example), but rather to be able to test if the test and training data definitively show similar distributions (i.e. grouping not-similar and can't-tell into the reject pile). One handy way to do this is to use the Kolmogorov Smirnov statistic, but not normalize for sample size. For testing the number of members in a cluster, you can use LLR, but I would tend to prefer doing binomial mutual information on each cluster, comparing number of members of the cluster versus number of non-members and training versus test. This will give you the same affirmative properties. Mutual information is the same as LLR divided by twice the number of samples. Again, these tests will help you avoid over-fitting, but won't help you determine whether one cluster or two is better (if you have lots of data, the two case should be always better). One heuristic is to use the largest number of clusters such that none fail the over-fitting test. On Sat, Jan 9, 2010 at 9:18 AM, Grant Ingersoll <[email protected]> wrote: > > This can be compared to > > the squared magnitude of the original data > > or the squared deviation from the > > centroid for all of the data. > > Not sure I am following these two ideas enough to code it up. > > For magnitude, do you mean compare the above against the magnitude of each > vector or do you mean something else? > -- Ted Dunning, CTO DeepDyve
