This is close.

The formulation here is definitely based on a classical frequentist
analysis.

But the analysis is of two multinomial distributions (binomial in this
case).  Both distributions describe people liking or not liking item A.  One
distribution is in the case of people who like B and the other is in the
case of people who don't like B.

The null hypothesis is that the distribution of A preferences is the same
for both B-likers and B-dislikers which is saying that the two distributions
share the value of their defining parameter.   The alternative is that the
two distributions have their own parameters.

Put another way, if we parameterize the A preference distributions with a
parameter p_A1 for the B-liking case and p_A2 for the B-disliking case, the
overall hypothesis space is the unit square defined by the cartesian product
of all values for p_A1 and p_A2.

The null hypothesis is the subset of that square where p_A1 = p_A2.  The
alternative is the entire square.  Note that the alternative is NOT the
square minus the null hypothesis.

This represents a nice situation mathematically where the null hypothesis is
a sub-set of the alternative.

Chernov showed in the 50's that the distribution of the log-likelihood ratio
for such a sub-set and super-set is asymptotically chi-squared distribution
if you sample uniformly from the null hypothesis parameter set.  That is
what makes log-likelihood ratio tests very cool.

Back to our case.

The twoLogLambda expression is intended to be just such a log likelihood
ratio for the case of the binomial distribution.  The normal Peason's
chi-squared test is an approximation of the log likelihood ratio using the
normal distribution so it blows up in the cases we want to work with.

It is tedious to work out the math, but not all that hard.  I am happy to
send the full derivation from my dissertation to anybody who wants it enough
to ask by email.

The result boils down to a few different forms.  One form is the one I used
in my ancient paper (here: http://acl.ldc.upenn.edu/J/J93/J93-1003.pdf ).
 That is similar to the form used in the code you have. Another form is the
one I pushed on my blog (here:
http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html )

I much prefer the second form because I can remember it precisely.  It is
the form used in org.apache.mahout.math.stats.LogLikelihood

Does that help?

On Sun, Jan 30, 2011 at 12:12 PM, Sean Owen <[email protected]> wrote:

> I reverse-engineered the logic, I think, from the code. In looking at
> item-item similarity, we're comparing the likelihood of two hypotheses. The
> null hypothesis that the size of the overlap in users that like the item is
> just what we'd expect from chance. The alternate hypothesis is that, well,
> it isn't, because the items are similar and therefore overlap unusually
> highly.
>
> The distribution of the size of overlap is just binomial. This comes in
> here...
>
>    return k * safeLog(p) + (n - k) * safeLog(1.0 - p);
>
> This is the log of the binomial pdf -- missing the constant factor part but
> this vanishes in the math that calls it anyway.
>
> And then this method ...
>
>                 2.0 * logL(k1 / n1, k1, n1)
>                  + logL(k2 / n2, k2, n2)
>                  - logL(p, k1, n1)
>                  - logL(p, k2, n2)
>
> is really
>
> -2.0 *  (logL(p, k1, n1) + logL(p, k2, n2))  -  (logL(k1 / n1, k1, n1) +
> logL(k2 / n2, k2, n2)
>
> The first two terms are the log of the likelihood of the null hypothesis.
> And the second two try on the same logic but assuming the overlap
> distributes differently.
>
>
> And then I think the input to twoLogLambda makes sense, I think...  the "p"
> from the null hypothesis ends up being the percentage of all users that
> prefer item 1. The null hypothesis is that the item's aren't similar, so
> their overlap should follow a similar ratio.
>

Reply via email to