I didn't have anything to do with the code originally, so I can only comment in generalities.
Degenerate clusters with radius zero are commonly a problem in evaluation metrics. Even if the cluster isn't exactly degenerate, if a sample of the cluster is, then you may have the same problem. These are also a problem in maximum likelihood methods because they try to cluster to maximize a metric that breaks under degeneracy. Sadly, a single point is the prototypical degenerate cluster so it is easy to have trouble break out. K-means avoids this by avoiding the concept of radius (i.e. fixing it in a way that it doesn't matter). Dirchlet mixtures handle it with a good prior. The CDbw metrics don't seem to handle this well. My tendency would be to impose some kind of prior in the computation of radii (implicit in the max-min that you mention). How to do this well isn't clear to me without spending more than my allowance in looking at the code or the paper. Sorry to be fragmentary. Hope it helps anyway. On Tue, Sep 28, 2010 at 10:22 AM, Jeff Eastman <[email protected]>wrote: > Sean, Robin, Ted: One of you guys evidently wrote the inter-cluster density > computation but did not include an intra-cluster computation in" Mahout In > Action". The CDbwEvaluator calculates both using only the representative > points (and may have been transcribed incorrectly from the paper to boot). > Please chime in.
