Ted, Thank you for the elaborate explanation.
I think, I just hit the class "this is left as an exercise to the reader. One last query (I hope) On your blog you have defined LLR as follows. > > LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k))) > where H is Shannon's entropy, computed as the sum of (k_ij / sum(k)) log > (k_ij / sum(k)) > I am unable to follow this. Do you have any code to explain this? Or elaboration for following example will also be equally great. > Also suppose that the corpus has 100,000 documents in it we have (k11, k12, > k21, k22, llr) as > > 5, 1995, 100, 97900, 2.96 > Thanks, --shashi On Mon, Aug 10, 2009 at 10:32 PM, Ted Dunning<[email protected]> wrote: > On Mon, Aug 10, 2009 at 6:51 AM, Shashikant Kore <[email protected]>wrote: > >> I have little difficulty in understanding LLR for cluster labels. >> > > Sorry about that. I will try to be more clear. > > >> For a phrase, if >> - in-cluster doc frequency is inDF >> - out-of-cluster doc frequency is outDF >> - size of the cluster is clusterSize >> - size of the corpus is corpusSize >> > > Good. > > >> how do I calculate the LLR? >> > > Assuming that the corpus is a superset of the cluster, form the table using: > > k11 = inDF > k12 = clusterSize - inDF > k21 = outDF > k22 = corpusSize - clusterSize - outDF > > If the cluster is not a subset of the corpus, then k22 = corpusSize - outDF > > >> I have difficulty in mapping these numbers to Event A & Event B that >> you talked on your blog. >> > > Event A is in-cluster, Event B is out-of-cluster. > > >> From the basic numbers, I could come up with inCluster percentage. But >> that doesn't help much. For example, let's say my cluster size is >> 2000 documents and corpus size is 1000. A phrase which occurs in the >> cluster in 5 documents and doesn't appear outside cluster has >> inCluster percentage of 100. Another phrase which occurs 1000 times in >> the cluster and 1000 times outside cluster. This phrase has inCluster >> percentage of 50. Intuitively, this is a better candidate for label >> that previous one. But, I am unable to figure out how these numbers >> need to be normalized. >> > > First, the corpus size should normally be much larger than your cluster > size. With document categorization, the ratio is enormous, with clustering > it should still be at least one order of magnitude larger. > > So let's take your example and add a case where in-cluster = 5 and > out-cluster =5, and another where in-cluster=5, out-cluster=100 and another > where in-cluster=1000 and out-cluster 45,000. > > Also suppose that the corpus has 100,000 documents in it we have (k11, k12, > k21, k22, llr) as > > 5, 1995, 0, 98000, 39.33 > 5, 1995, 5, 97995, 25.47 > 5, 1995, 100, 97900, 2.96 > 1000, 1000. 1000, 97000, 5714.93 > 1000, 1000, 45000, 48000, 2.04 > > According to llr, your original case of 5 in and 0 out is definitely worthy > of mention and the case with 5 in and 5 out is somewhat less interesting. > The case with 5 in and 100 out is not very interesting, nor is the case with > 1000 in and 45000 out. Your case with 1000 in and 1000 out is the most > exciting of them all. > > The most useful way to think of this is as the percentage of in-cluster > documents that have the feature (term) versus the percentage out, keeping in > mind that both percentages are uncertain since we have only a sample of all > possible documents. Where these percentages are very different and where > that difference is unlikely to be due to accidental variation, then LLR will > be large. > > > I don't know if I mentioned this on the blog, but it is often nice to > rescale these scores by taking the square root and adding a sign according > to whether k11/(k11+k12) > k21/(k21+k22). This gives you a number that has > the same scale as a normal distribution so lots more people will have good > intuitions about what is large and what is not. >
