I opened https://issues.apache.org/jira/browse/MAHOUT-236 for this.

Some questions inline.

On Jun 17, 2009, at 2:51 AM, Ted Dunning wrote:

> A principled approach to cluster evaluation is to measure how well the
> cluster membership captures the structure of unseen data.  A natural measure
> for this is to measure how much of the entropy of the data is captured by
> cluster membership.  For k-means and its natural L_2 metric, the natural
> cluster quality metric is the squared distance from the nearest centroid
> adjusted by the log_2 of the number of clusters.

Makes sense.

>  This can be compared to
> the squared magnitude of the original data
> or the squared deviation from the
> centroid for all of the data.  

Not sure I am following these two ideas enough to code it up.  

For magnitude, do you mean compare the above against the magnitude of each 
vector or do you mean something else?

And for the second part, do you mean to calculate the centroid for all the data 
and then compare the sq. deviation of the vector under scrutiny to that 
centroid?

Code and/or math here would be helpful.

Also, what about other clustering algs besides k-Means and the L_2 metric?

> The idea is that you are changing the
> representation of the data by allocating some of the bits in your original
> representation to represent which cluster each point is in.  If those bits
> aren't made up by the residue being small then your clustering is making a
> bad trade-off.
> 
> In the past, I have used other more heuristic measures as well.  One of the
> key characteristics that I would like to see out of a clustering is a degree
> of stability.  Thus, I look at the fractions of points that are assigned to
> each cluster

So if we just spit out the percentage of points for each cluster, then someone 
could re-run with the original data plus the held out data, and those 
percentages should still be about the same, assuming the held-out data is 
randomly distributed in the space.

> or the distribution of distances from the cluster centroid.

Likewise, we'd expect the held out points to be randomly distributed distance 
wise as well, right?

> These values should be relatively stable when applied to held-out data.
> 
> For text, you can actually compute perplexity which measures how well
> cluster membership predicts what words are used.  This is nice because you
> don't have to worry about the entropy of real valued numbers.

Do you have a good ref. on perplexity and/or some R code (or other)?

Thanks,
Grant

Reply via email to