On Thu, Aug 6, 2009 at 2:57 AM, Ted Dunning<[email protected]> wrote: > > Generally, I find it better to use a more nuanced approach to try to get > something like best common substring, or best over-represented sub-strings. > A nice way to do that is the log-likelihood ratio test that I use for > everything under the sun. This would consider in-cluster and out-of-cluster > as two classes and would consider the frequency of each possible term or > phrase in these two classes. This will give you words and phrases that are > anomalously common in your cluster and relatively rare outside it. You may > want to use document frequency for these comparisons since you can often get > those frequencies from, for example, a Lucene index more easily than the > actual number of occurrences. >
Hi Ted, I have little difficulty in understanding LLR for cluster labels. For a phrase, if - in-cluster doc frequency is inDF - out-of-cluster doc frequency is outDF - size of the cluster is clusterSize - size of the corpus is corpusSize how do I calculate the LLR? I have difficulty in mapping these numbers to Event A & Event B that you talked on your blog. >From the basic numbers, I could come up with inCluster percentage. But that doesn't help much. For example, let's say my cluster size is 2000 documents and corpus size is 1000. A phrase which occurs in the cluster in 5 documents and doesn't appear outside cluster has inCluster percentage of 100. Another phrase which occurs 1000 times in the cluster and 1000 times outside cluster. This phrase has inCluster percentage of 50. Intuitively, this is a better candidate for label that previous one. But, I am unable to figure out how these numbers need to be normalized. It would be great if you could shed some light on this. Thanks, --shashi
