Maybe the formulation in the code now is slightly wrong. The key math is:
double logLikelihood = twoLogLambda(preferring1and2,
preferring1 - preferring1and2,
preferring2,
numUsers - preferring2);
static double twoLogLambda(double k1, double k2, double n1, double n2) {
double p = (k1 + k2) / (n1 + n2);
return 2.0 * (logL(k1 / n1, k1, n1)
+ logL(k2 / n2, k2, n2)
- logL(p, k1, n1)
- logL(p, k2, n2));
}
private static double logL(double p, double k, double n) {
return k * Math.log(p) + (n - k) * Math.log(1.0 - p);
}
Those parameters are relatively self-explanatory I think, in the context of
item-item similarity. But, do they match your formulation? I don't recall
what the 2x2 matrix means.
In my example, the difference is between calling
twoLogLambda(4, 1, 4, 1)
and
twoLogLambda(4, 0, 5, 0)
In both cases, 4 people prefer both items. 5 people like item 1, 4 people
like item 2. There are 5 people total.
One is zero, one is NaN.
The second is NaN because k2 = n2 = 0 -- because nobody prefers item 2 and
not item 1, and because there is nobody that doesn't prefer item 1.
On Sat, Jan 29, 2011 at 5:59 PM, Ted Dunning <[email protected]> wrote:
> I don't think that this is correct. The LLR computation should be
> symmetric. The purpose of safeLog is to allow lim_{x \goesto 0} x \log x =
> 0.
> In your example I think that we have the following interaction matrix
>
>
> + -
> A 5 0
> B 4 1
>
> Using R, it is easy to show the the llr is the same transposed or not:
>
> > A = matrix(c(5,0,4,1),nrow=2)
> > llr(A)
> [1] 1.497635
> > llr(t(A))
> [1] 1.497635
> >
>
> The same applies if we reverse the order of the rows. In fact, I don't see
> how your can make this example asymmetrical.
>
> On the other hand, if we have, say, a million people and have 1000 interact
> with A and 10 interact with B and have 5 interact with A and B, then we
> have
> a somewhat different matrix:
>
> B -B
> A 5 995
> -A 5 998,995
>
> But again, we get A related to B at the same level as B to A. In R:
>
> > A = matrix(c(5, 5, 995, 998995), nrow=2)
> > A
> [,1] [,2]
> [1,] 5 995
> [2,] 5 998995
> > t(A)
> [,1] [,2]
> [1,] 5 5
> [2,] 995 998995
> > llr(A)
> [1] 55.24958
> > llr(t(A))
> [1] 55.24958
>
> Now asymmetry can creep in if we are limiting the number of non-zero
> elements in rows of the sparsified related items table. Thus this score of
> 55.25 might be in the top 50 scores from A to B, but not in the top 50
> scores from B to A.
>
> But none of this explains why setting log(0) = 0 causes any problems or NaN
> results. That means I am likely misunderstanding you completely.
>
> Here, btw, is the R definition of llr. The use of (k==0) in llr.H is the
> equivalent of safeLog.
>
> > llr
> function(k) {
> r = 2* sum(k) * (llr.H(k) - llr.H(rowSums(k)) - llr.H(colSums(k)))
> if (r < 0 && r > -1e-12) {
> r = 0
> }
> r
> }
>
> > llr.H
> function(k) {
> N = sum(k)
> sum(k/N * log(k/N + (k==0)))
> }
>
> On Sat, Jan 29, 2011 at 6:43 AM, Sean Owen <[email protected]> wrote:
> >
> > In LogLikelihoodSimilarity, you'll find a function safeLog() which
> returns
> > 0.0, rather than NaN, when the log of a non-positive number is computed.
> >
> > It creates an asymmetry in corner cases. For example, imagine we have 5
> > users. All 5 are associated to item A; all but one are associated to item
> B.
> > The similarity between 1 and 2 is 0.0, but the similarity between 2 and 1
> is
> > NaN.
> >
> > Taking off this safety feature makes both NaN. I think it's neither more
> or
> > less theoretically defensible to go either way. But in practice, it's
> > slightly bad as it means no similarity is available in some edge cases.
> >
> > My intuition says we should be wary of edge cases -- you can get funny
> > results. So part of me thinks it's good to turn these into NaN and ignore
> > them. They are after all 0.0 similar at the moment.
> >
> > Is my intuition roughly right?
>