The definition of
org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(long k11,
long k12, long k21, long k22):

    public static double logLikelihoodRatio(long k11, long k12, long k21,
long k22) {
      Preconditions.checkArgument(k11 >= 0 && k12 >= 0 && k21 >= 0 && k22
>= 0);
      // note that we have counts here, not probabilities, and that the
entropy is not normalized.
*      double rowEntropy = entropy(k11, k12) + entropy(k21, k22);*
*      double columnEntropy = entropy(k11, k21) + entropy(k12, k22);*
      double matrixEntropy = entropy(k11, k12, k21, k22);
      if (rowEntropy + columnEntropy > matrixEntropy) {
        // round off error
        return 0.0;
      }
      return 2.0 * (matrixEntropy - rowEntropy - columnEntropy);
    }

The *rowEntropy* and *columnEntropy* computed here might be *wrong*, I
think it should be:

*      double rowEntropy = entropy(k11+k12, k21+k22)*
*      double columnEntropy = entropy(k11+k21, k12+k22)*
*
*
which is the same as *LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))
*referred from
http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html .

LLR = G2 = 2 * N * I where N is the sample size (k11 + k12 + k21 + k22 in
this example), and I is the mutual infomation.

[image: 内嵌图片 1] where x is eventA value can be 1 or 2, and y is eventB
value canbe 1 or 2. p(x,y) = kxy/N, p(x) = p(x,1) + p(x,2). e.g. p(1,1) =
k11/N.


[image: 内嵌图片 2] here we get mutual_information = H(k) - H(rowSums(k)) -
H(colSums(k)

The mahout version of unnormalized entropy(k11,k12,k21,k22) = N * H(k), we
get:

    *entropy(k11,k12,k21,k22) - entropy(k11+k12, k21+k22) -
entropy(k11+k21, k12+k22) = N*(H(k) - H(rowSums(k)) - H(colSums(k))*
*
*
that multiplied by 2.0 is just the LLR.

Is the org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio wrong
or have I misunderstood something?

Reply via email to