This is a horrifying possibility.  I thought we had several test cases in
place to verify this code.

Let me look.  I wonder if the code you have found is not referenced somehow.


On Sun, Jun 2, 2013 at 11:23 PM, 陈文龙 <qzche...@gmail.com> wrote:

> The definition of
> org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(long k11,
> long k12, long k21, long k22):
>
>     public static double logLikelihoodRatio(long k11, long k12, long k21,
> long k22) {
>       Preconditions.checkArgument(k11 >= 0 && k12 >= 0 && k21 >= 0 && k22
> >= 0);
>       // note that we have counts here, not probabilities, and that the
> entropy is not normalized.
> *      double rowEntropy = entropy(k11, k12) + entropy(k21, k22);*
> *      double columnEntropy = entropy(k11, k21) + entropy(k12, k22);*
>       double matrixEntropy = entropy(k11, k12, k21, k22);
>       if (rowEntropy + columnEntropy > matrixEntropy) {
>         // round off error
>         return 0.0;
>       }
>       return 2.0 * (matrixEntropy - rowEntropy - columnEntropy);
>     }
>
> The *rowEntropy* and *columnEntropy* computed here might be *wrong*, I
> think it should be:
>
> *      double rowEntropy = entropy(k11+k12, k21+k22)*
> *      double columnEntropy = entropy(k11+k21, k12+k22)*
> *
> *
> which is the same as *LLR = 2 sum(k) (H(k) - H(rowSums(k)) -
> H(colSums(k))) *referred from
> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html .
>
> LLR = G2 = 2 * N * I where N is the sample size (k11 + k12 + k21 + k22 in
> this example), and I is the mutual infomation.
>
> [image: 内嵌图片 1] where x is eventA value can be 1 or 2, and y is eventB
> value canbe 1 or 2. p(x,y) = kxy/N, p(x) = p(x,1) + p(x,2). e.g. p(1,1) =
> k11/N.
>
>
> [image: 内嵌图片 2] here we get mutual_information = H(k) - H(rowSums(k)) -
> H(colSums(k)
>
> The mahout version of unnormalized entropy(k11,k12,k21,k22) = N * H(k), we
> get:
>
>     *entropy(k11,k12,k21,k22) - entropy(k11+k12, k21+k22) -
> entropy(k11+k21, k12+k22) = N*(H(k) - H(rowSums(k)) - H(colSums(k))*
> *
> *
> that multiplied by 2.0 is just the LLR.
>
> Is the org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio wrong
> or have I misunderstood something?
>

Reply via email to