This is close. The formulation here is definitely based on a classical frequentist analysis.
But the analysis is of two multinomial distributions (binomial in this case). Both distributions describe people liking or not liking item A. One distribution is in the case of people who like B and the other is in the case of people who don't like B. The null hypothesis is that the distribution of A preferences is the same for both B-likers and B-dislikers which is saying that the two distributions share the value of their defining parameter. The alternative is that the two distributions have their own parameters. Put another way, if we parameterize the A preference distributions with a parameter p_A1 for the B-liking case and p_A2 for the B-disliking case, the overall hypothesis space is the unit square defined by the cartesian product of all values for p_A1 and p_A2. The null hypothesis is the subset of that square where p_A1 = p_A2. The alternative is the entire square. Note that the alternative is NOT the square minus the null hypothesis. This represents a nice situation mathematically where the null hypothesis is a sub-set of the alternative. Chernov showed in the 50's that the distribution of the log-likelihood ratio for such a sub-set and super-set is asymptotically chi-squared distribution if you sample uniformly from the null hypothesis parameter set. That is what makes log-likelihood ratio tests very cool. Back to our case. The twoLogLambda expression is intended to be just such a log likelihood ratio for the case of the binomial distribution. The normal Peason's chi-squared test is an approximation of the log likelihood ratio using the normal distribution so it blows up in the cases we want to work with. It is tedious to work out the math, but not all that hard. I am happy to send the full derivation from my dissertation to anybody who wants it enough to ask by email. The result boils down to a few different forms. One form is the one I used in my ancient paper (here: http://acl.ldc.upenn.edu/J/J93/J93-1003.pdf ). That is similar to the form used in the code you have. Another form is the one I pushed on my blog (here: http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html ) I much prefer the second form because I can remember it precisely. It is the form used in org.apache.mahout.math.stats.LogLikelihood Does that help? On Sun, Jan 30, 2011 at 12:12 PM, Sean Owen <[email protected]> wrote: > I reverse-engineered the logic, I think, from the code. In looking at > item-item similarity, we're comparing the likelihood of two hypotheses. The > null hypothesis that the size of the overlap in users that like the item is > just what we'd expect from chance. The alternate hypothesis is that, well, > it isn't, because the items are similar and therefore overlap unusually > highly. > > The distribution of the size of overlap is just binomial. This comes in > here... > > return k * safeLog(p) + (n - k) * safeLog(1.0 - p); > > This is the log of the binomial pdf -- missing the constant factor part but > this vanishes in the math that calls it anyway. > > And then this method ... > > 2.0 * logL(k1 / n1, k1, n1) > + logL(k2 / n2, k2, n2) > - logL(p, k1, n1) > - logL(p, k2, n2) > > is really > > -2.0 * (logL(p, k1, n1) + logL(p, k2, n2)) - (logL(k1 / n1, k1, n1) + > logL(k2 / n2, k2, n2) > > The first two terms are the log of the likelihood of the null hypothesis. > And the second two try on the same logic but assuming the overlap > distributes differently. > > > And then I think the input to twoLogLambda makes sense, I think... the "p" > from the null hypothesis ends up being the percentage of all users that > prefer item 1. The null hypothesis is that the item's aren't similar, so > their overlap should follow a similar ratio. >
