This is an interesting topic.
The code I quoted was just what's in the code base now. I thought it had
made sense to me -- maybe it's working differently but in a valid way?
I reverse-engineered the logic, I think, from the code. In looking at
item-item similarity, we're comparing the likelihood of two hypotheses. The
null hypothesis that the size of the overlap in users that like the item is
just what we'd expect from chance. The alternate hypothesis is that, well,
it isn't, because the items are similar and therefore overlap unusually
highly.
The distribution of the size of overlap is just binomial. This comes in
here...
return k * safeLog(p) + (n - k) * safeLog(1.0 - p);
This is the log of the binomial pdf -- missing the constant factor part but
this vanishes in the math that calls it anyway.
And then this method ...
2.0 * logL(k1 / n1, k1, n1)
+ logL(k2 / n2, k2, n2)
- logL(p, k1, n1)
- logL(p, k2, n2)
is really
-2.0 * (logL(p, k1, n1) + logL(p, k2, n2)) - (logL(k1 / n1, k1, n1) +
logL(k2 / n2, k2, n2)
The first two terms are the log of the likelihood of the null hypothesis.
And the second two try on the same logic but assuming the overlap
distributes differently.
And then I think the input to twoLogLambda makes sense, I think... the "p"
from the null hypothesis ends up being the percentage of all users that
prefer item 1. The null hypothesis is that the item's aren't similar, so
their overlap should follow a similar ratio.
It looks at the ratio of users preferring both items, to users preferring 2
(whether or not 1 is preferred). It also looks at the users that prefer 1
but not 2, compared to all those users that don't prefer 2. Those ratios
ought to be the same -- p.
But then it also checks what the likelihood if those ratios follow
distributions with a different p -- the "p" that is simply the ratio of the
actual observed counts.
Is my understanding of the application of log-likelihood here, well,
correct, and is it reasonable?
Is it possible to paraphrase your formulation more to understand how it's
different?
PS why would it be desirable to map log(0) to 0? the limit is negative
infinity.
On Sat, Jan 29, 2011 at 7:31 PM, Ted Dunning <[email protected]> wrote:
> the contents of the 2 x 2 matrix are more easily understood if we augment
> it
> with row and column sums:
>
> 1and2 1not2 | preferring1
> not1and2 not1not2 | not1
> ---------------------+------------
> preferring2 not2 | all
>
> So out of this, 1not2 = preferring1 - 1and2 and not1and2 =
> preferring2-1and2 and not2 = all - preferring2 and not1not2 =
> not2 - 1not2 = all - preferring1 - preferring2 + 1and2
>
> Thus I think your code should be:
>
> double logLikelihood = twoLogLambda(preferring1and2,
> preferring1 - preferring1and2,
> preferring2 - preferring1and2,
> numUsers - preferring1 - preferring2
> + preferring1and2);
>
> I find it easier to understand the twoLogLambda code if it is written this
> way:
>
> double twoLogLambda(k11, k12, k21, k22) {
> return 2 * ( kLogP(k11, k12, k21, k22) - kLogP(k11+k12, k21+k22)
> - kLogP(k11+k21,
> k12+k22) )
> }
>
> double kLogP(int... values) {
> double total = 0;
> for (int x : values) {
> total += x;
> }
> double result = 0;
> for (int x : values) {
> if (x > 0) {
> result += k * Math.log(k / total);
> }
> }
> return result;
> }
>
> We have code in Mahout that does something like this. It is also
> essentially the same as the R code I gave earlier.
>
> On Sat, Jan 29, 2011 at 11:09 AM, Sean Owen <[email protected]> wrote:
>
> > Maybe the formulation in the code now is slightly wrong. The key math is:
> >
> > double logLikelihood = twoLogLambda(preferring1and2,
> > preferring1 - preferring1and2,
> > preferring2,
> > numUsers - preferring2);
> >
> >
> > static double twoLogLambda(double k1, double k2, double n1, double n2) {
> > double p = (k1 + k2) / (n1 + n2);
> > return 2.0 * (logL(k1 / n1, k1, n1)
> > + logL(k2 / n2, k2, n2)
> > - logL(p, k1, n1)
> > - logL(p, k2, n2));
> > }
> >
> > private static double logL(double p, double k, double n) {
> > return k * Math.log(p) + (n - k) * Math.log(1.0 - p);
> > }
> >
>