This is a horrifying possibility. I thought we had several test cases in place to verify this code.
Let me look. I wonder if the code you have found is not referenced somehow. On Sun, Jun 2, 2013 at 11:23 PM, 陈文龙 <qzche...@gmail.com> wrote: > The definition of > org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(long k11, > long k12, long k21, long k22): > > public static double logLikelihoodRatio(long k11, long k12, long k21, > long k22) { > Preconditions.checkArgument(k11 >= 0 && k12 >= 0 && k21 >= 0 && k22 > >= 0); > // note that we have counts here, not probabilities, and that the > entropy is not normalized. > * double rowEntropy = entropy(k11, k12) + entropy(k21, k22);* > * double columnEntropy = entropy(k11, k21) + entropy(k12, k22);* > double matrixEntropy = entropy(k11, k12, k21, k22); > if (rowEntropy + columnEntropy > matrixEntropy) { > // round off error > return 0.0; > } > return 2.0 * (matrixEntropy - rowEntropy - columnEntropy); > } > > The *rowEntropy* and *columnEntropy* computed here might be *wrong*, I > think it should be: > > * double rowEntropy = entropy(k11+k12, k21+k22)* > * double columnEntropy = entropy(k11+k21, k12+k22)* > * > * > which is the same as *LLR = 2 sum(k) (H(k) - H(rowSums(k)) - > H(colSums(k))) *referred from > http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html . > > LLR = G2 = 2 * N * I where N is the sample size (k11 + k12 + k21 + k22 in > this example), and I is the mutual infomation. > > [image: 内嵌图片 1] where x is eventA value can be 1 or 2, and y is eventB > value canbe 1 or 2. p(x,y) = kxy/N, p(x) = p(x,1) + p(x,2). e.g. p(1,1) = > k11/N. > > > [image: 内嵌图片 2] here we get mutual_information = H(k) - H(rowSums(k)) - > H(colSums(k) > > The mahout version of unnormalized entropy(k11,k12,k21,k22) = N * H(k), we > get: > > *entropy(k11,k12,k21,k22) - entropy(k11+k12, k21+k22) - > entropy(k11+k21, k12+k22) = N*(H(k) - H(rowSums(k)) - H(colSums(k))* > * > * > that multiplied by 2.0 is just the LLR. > > Is the org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio wrong > or have I misunderstood something? >