Brown and DiPietro's algorithm for clustering based on entropy is somewhat infamous for the difficulty of achieving usable performance. Mike Collins was responsible for a famously speedy version. Having build one that is just barely fast enough in C++, I wouldn't recommend trying it in Java. Of course, you aren't proposing that, just recommending the bigram entropy metric or something like it.
On Mon, Jul 27, 2009 at 9:42 PM, Ted Dunning<[email protected]> wrote: > (vastly delayed response ... huge distractions competing with more than 2 > minutes answers are to blame) > > Grant, > > For evaluating clustering for symbol sequences: > > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.7275 > > Most of the other references I have found talk about quality relative to > gold standard judgments about whether exemplars are in the same class or > relative to similarity/distinctiveness ratios. Neither is all that > satisfactory. > > My preference is an entropic measure that describes how much of the > information in your data is captured by the clustering vs how much residual > info there is. > > The other reference I am looking for may be in David Mackay's book. The > idea is that you measure the quality of the approximation by looking at the > entropy in the cluster assignment relative to the residual required to > precisely specify the original data relative to the quantized value. > > This is also related to trading off signal/noise in a vector quantizer. > > David, do you have a moment to talk about this with me? I can't free up > the time to chase these final references and come up with a nice formula for > this. I think you could do it in 10-20 minutes. > > On Tue, Jul 14, 2009 at 6:41 AM, Grant Ingersoll <[email protected]>wrote: > >> On Jun 17, 2009, at 2:51 AM, Ted Dunning wrote: >> >> A principled approach to cluster evaluation is to measure how well the >>> cluster membership captures the structure of unseen data. A natural >>> measure >>> for this is to measure how much of the entropy of the data is captured by >>> cluster membership. For k-means and its natural L_2 metric, the natural >>> cluster quality metric is the squared distance from the nearest centroid >>> adjusted by the log_2 of the number of clusters. This can be compared to >>> the squared magnitude of the original data or the squared deviation from >>> the >>> centroid for all of the data. The idea is that you are changing the >>> representation of the data by allocating some of the bits in your original >>> representation to represent which cluster each point is in. If those bits >>> aren't made up by the residue being small then your clustering is making a >>> bad trade-off. >>> >>> In the past, I have used other more heuristic measures as well. One of >>> the >>> key characteristics that I would like to see out of a clustering is a >>> degree >>> of stability. Thus, I look at the fractions of points that are assigned >>> to >>> each cluster or the distribution of distances from the cluster centroid. >>> These values should be relatively stable when applied to held-out data. >>> >>> For text, you can actually compute perplexity which measures how well >>> cluster membership predicts what words are used. This is nice because you >>> don't have to worry about the entropy of real valued numbers. >>> >> >> Do you have any references on any of the above approaches? >> > > > > -- > Ted Dunning, CTO > DeepDyve > > 111 West Evelyn Ave. Ste. 202 > Sunnyvale, CA 94086 > http://www.deepdyve.com > 858-414-0013 (m) > 408-773-0220 (fax) >
