I didn't have anything to do with the code originally, so I can only comment
in generalities.

Degenerate clusters with radius zero are commonly a problem in evaluation
metrics.  Even if the cluster isn't exactly degenerate, if a sample of the
cluster is, then you may have the same problem.  These are also a problem in
maximum likelihood methods because
they try to cluster to maximize a metric that breaks under degeneracy.
 Sadly, a single point is the prototypical degenerate cluster
so it is easy to have trouble break out.

K-means avoids this by avoiding the concept of radius (i.e. fixing it in a
way that it doesn't matter).  Dirchlet mixtures handle it with a good prior.

The CDbw metrics don't seem to handle this well.  My tendency would be to
impose some kind of prior in the computation of radii (implicit in the
max-min that you mention).  How to do this well isn't clear to me without
spending more than my allowance in looking
at the code or the paper.

Sorry to be fragmentary.  Hope it helps anyway.



On Tue, Sep 28, 2010 at 10:22 AM, Jeff Eastman
<[email protected]>wrote:

> Sean, Robin, Ted: One of you guys evidently wrote the inter-cluster density
> computation but did not include an intra-cluster computation in" Mahout In
> Action". The CDbwEvaluator calculates both using only the representative
> points (and may have been transcribed incorrectly from the paper to boot).
> Please chime in.

Reply via email to