Great! Solved the equations on paper and now I understand entroy and LLR.
Thanks, Ted. --shashi On Tue, Aug 11, 2009 at 12:11 AM, Ted Dunning<[email protected]> wrote: > I have some R code handy. It should be easy to translate to Mahoutish Java. > > *entropy <- function(x) {-sum(x*log((x+(x==0))/sum(x)))}* > > The only tricky bit here is that sum will sum up a vector or a matrix and > (x==0) returns a matrix or vector of booleans that is the same shape as x. > Adding a boolean means that all instances where x!=0 are left alone while > all zeros are incremented by one. This causes the log to return zero, but > we are (elementwise) multiplying by x anyway. Elements of x cannot be < 0 > by definition (they are counts). The reason for all of that is so that 0 > log 0 = 0 as desired. If sum(x) = 1, then this function will give you > Shannon entropy. > > and then: > > *llr <- function(x) { 2*(entropy(rowSums(x)) + entropy(colSums(x)) - > entropy(x))} > * > > The functions rowSums and colSums do pretty much what you would expect. > Given a matrix, they add up the elements in each row or column (to get a > vector). I moved around some minus signs to make me happy, btw. All that > matters in the end is that entropy and llr both give positive values. > > Some test cases: > > *> entropy(c(1,1)) > [1] 1.386294 >> llr(matrix(c(1,0,0,1), nrow=2)) > [1] 2.772589 >> llr(matrix(c(10,0,0,10), nrow=2)) > [1] 27.72589 >> llr(matrix(c(5,1995,0,100000), nrow=2)) > [1] 39.33052 >> llr(matrix(c(1000,1995,1000,100000), nrow=2)) > [1] 4730.737 >> llr(matrix(c(1000,1000,1000,100000), nrow=2)) > [1] 5734.343 >> llr(matrix(c(1000,1000,1000,99000), nrow=2)) > [1] 5714.932 > * > > > On Mon, Aug 10, 2009 at 11:11 AM, Shashikant Kore <[email protected]>wrote: > >> Ted, >> >> Thank you for the elaborate explanation. >> >> I think, I just hit the class "this is left as an exercise to the >> reader. One last query (I hope) >> >> On your blog you have defined LLR as follows. >> >> > >> > LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k))) >> > where H is Shannon's entropy, computed as the sum of (k_ij / sum(k)) log >> (k_ij / sum(k)) >> > >> >> I am unable to follow this. Do you have any code to explain this? Or >> elaboration for following example will also be equally great. >> >> > Also suppose that the corpus has 100,000 documents in it we have (k11, >> k12, k21, k22, llr) as >> > >> > 5, 1995, 100, 97900, 2.96 >> > >> >> Thanks, >> >> --shashi >> >> On Mon, Aug 10, 2009 at 10:32 PM, Ted Dunning<[email protected]> >> wrote: >> > On Mon, Aug 10, 2009 at 6:51 AM, Shashikant Kore <[email protected] >> >wrote: >> > >> >> I have little difficulty in understanding LLR for cluster labels. >> >> >> > >> > Sorry about that. I will try to be more clear. >> > >> > >> >> For a phrase, if >> >> - in-cluster doc frequency is inDF >> >> - out-of-cluster doc frequency is outDF >> >> - size of the cluster is clusterSize >> >> - size of the corpus is corpusSize >> >> >> > >> > Good. >> > >> > >> >> how do I calculate the LLR? >> >> >> > >> > Assuming that the corpus is a superset of the cluster, form the table >> using: >> > >> > k11 = inDF >> > k12 = clusterSize - inDF >> > k21 = outDF >> > k22 = corpusSize - clusterSize - outDF >> > >> > If the cluster is not a subset of the corpus, then k22 = corpusSize - >> outDF >> > >> > >> >> I have difficulty in mapping these numbers to Event A & Event B that >> >> you talked on your blog. >> >> >> > >> > Event A is in-cluster, Event B is out-of-cluster. >> > >> > >> >> From the basic numbers, I could come up with inCluster percentage. But >> >> that doesn't help much. For example, let's say my cluster size is >> >> 2000 documents and corpus size is 1000. A phrase which occurs in the >> >> cluster in 5 documents and doesn't appear outside cluster has >> >> inCluster percentage of 100. Another phrase which occurs 1000 times in >> >> the cluster and 1000 times outside cluster. This phrase has inCluster >> >> percentage of 50. Intuitively, this is a better candidate for label >> >> that previous one. But, I am unable to figure out how these numbers >> >> need to be normalized. >> >> >> > >> > First, the corpus size should normally be much larger than your cluster >> > size. With document categorization, the ratio is enormous, with >> clustering >> > it should still be at least one order of magnitude larger. >> > >> > So let's take your example and add a case where in-cluster = 5 and >> > out-cluster =5, and another where in-cluster=5, out-cluster=100 and >> another >> > where in-cluster=1000 and out-cluster 45,000. >> > >> > Also suppose that the corpus has 100,000 documents in it we have (k11, >> k12, >> > k21, k22, llr) as >> > >> > 5, 1995, 0, 98000, 39.33 >> > 5, 1995, 5, 97995, 25.47 >> > 5, 1995, 100, 97900, 2.96 >> > 1000, 1000. 1000, 97000, 5714.93 >> > 1000, 1000, 45000, 48000, 2.04 >> > >> > According to llr, your original case of 5 in and 0 out is definitely >> worthy >> > of mention and the case with 5 in and 5 out is somewhat less interesting. >> > The case with 5 in and 100 out is not very interesting, nor is the case >> with >> > 1000 in and 45000 out. Your case with 1000 in and 1000 out is the most >> > exciting of them all. >> > >> > The most useful way to think of this is as the percentage of in-cluster >> > documents that have the feature (term) versus the percentage out, keeping >> in >> > mind that both percentages are uncertain since we have only a sample of >> all >> > possible documents. Where these percentages are very different and where >> > that difference is unlikely to be due to accidental variation, then LLR >> will >> > be large. >> > >> > >> > I don't know if I mentioned this on the blog, but it is often nice to >> > rescale these scores by taking the square root and adding a sign >> according >> > to whether k11/(k11+k12) > k21/(k21+k22). This gives you a number that >> has >> > the same scale as a normal distribution so lots more people will have >> good >> > intuitions about what is large and what is not. >> > >> > > > > -- > Ted Dunning, CTO > DeepDyve >
