Thank you for the reply Ted. I see that they are equivalent now. >From wikipedia H(K) = -sum_i ( (k_i/N) * log(k_i/N) ) where N = sum_i(k_i)
>From the code in org.apache.mahout.math.stats.Loglikelihood.java H(K) = -( sum_i( k_i*log (k_i) ) - N*log(N) ) where N = sum_i(k_i) >From your blog entry H(K) = sum_i( (k_i/N)*log(k_i) ) - logN where N = sum_i(k_i) During the LLR calculation though, in the Loglikelihood.java, the LLR is calculated as 2 * ( H(matrix) - H(rowSum) - H(colSum)) whereas in your blog entry it is mentioned as 2 * N * ( H(matrix) - H(rowSums(k)) - H(colSums(k)) ) the extra N in the second calculation is required as that is a remnant from the way the entropy is calculated in that scenario. I am still trying to understand the different ways in which the rowSum and columnSum entropies are calculated in the LLR function. I think they are also equivalent to each other. I just need to work out that. Nevertheless, the LLR value resulting from both the methods is the same, that is why I was trying to find out how they are equivalent. Thank you for taking the time to respond to my question. Gangadhar On Fri, Sep 10, 2010 at 2:14 AM, Ted Dunning <[email protected]> wrote: > What difference are you seeing? > > The entropy calculation in LogLikelihood is what I would call un-normalized > entropy > > H(K) = - \sum_i k_i log (k_i / n) > > This makes the expression for the log-likelihood ratio slightly simpler. > The log (k_i / n) is also split into > two parts to avoid doing lots of divisions. This gives: > > H(K) = -\sum_i k_i (log(k_i) - log(n)) = -\sum_i k_i log(k_i) + \sum_i > k_i log(n) > = n log(n) - \sum_i k_i log(k_i) > > > > On Thu, Sep 9, 2010 at 10:06 PM, Gangadhar Nittala > <[email protected]>wrote: > >> All, >> I am a first time user of Mahout. I checked out the code and was able >> to get the build going. I was checking the Tasks list (I use Eclipse) >> and saw one in the LogLikelihoodTest.java to check the epsilons. >> >> While checking the code in the LogLikelohood.java >> (org.apache.mahout.math.stats.Loglikelihood.java), I saw that the code >> for the Shannon entropy calculation seemed different from the one as >> it is defined on Wikipedia >> [http://en.wikipedia.org/wiki/Entropy_(information_theory)]. >> >> I wrote a small python script (attached- llr.py) to compare the one >> that is present in Mahout >> (org.apache.mahout.math.stats.Loglikelihood.java) and the one that is >> defined by Ted Dunning in >> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html. >> Even though the entropy and LLR calculations are different, the final >> output of the LLR is the same with both the methods. >> >> I am trying to find out why both the methods are equivalent. Can you >> please let me know why this is the case or if there is a reference I >> can check I shall do that. If this is not the list for this question, >> I am sorry, I shall try the mahout-users list. >> >> Thank you >> Gangadhar >> >
