Re: Question about log-likelihood formulation

Sean Owen Sat, 29 Jan 2011 11:10:03 -0800

Maybe the formulation in the code now is slightly wrong. The key math is:

    double logLikelihood = twoLogLambda(preferring1and2,
                                        preferring1 - preferring1and2,
                                        preferring2,
                                        numUsers - preferring2);



  static double twoLogLambda(double k1, double k2, double n1, double n2) {
    double p = (k1 + k2) / (n1 + n2);
    return 2.0 * (logL(k1 / n1, k1, n1)
                  + logL(k2 / n2, k2, n2)
                  - logL(p, k1, n1)
                  - logL(p, k2, n2));
  }

  private static double logL(double p, double k, double n) {
    return k * Math.log(p) + (n - k) * Math.log(1.0 - p);
  }


Those parameters are relatively self-explanatory I think, in the context of
item-item similarity. But, do they match your formulation? I don't recall
what the 2x2 matrix means.


In my example, the difference is between calling

twoLogLambda(4, 1, 4, 1)
and
twoLogLambda(4, 0, 5, 0)

In both cases, 4 people prefer both items. 5 people like item 1, 4 people
like item 2. There are 5 people total.

One is zero, one is NaN.

The second is NaN because k2 = n2 = 0 -- because nobody prefers item 2 and
not item 1, and because there is nobody that doesn't prefer item 1.


On Sat, Jan 29, 2011 at 5:59 PM, Ted Dunning <[email protected]> wrote:

> I don't think that this is correct.  The LLR computation should be
> symmetric.  The purpose of safeLog is to allow lim_{x \goesto 0} x \log x =
> 0.
> In your example I think that we have the following interaction matrix
>
>
>   +  -
> A  5  0
> B  4  1
>
> Using R, it is easy to show the the llr is the same transposed or not:
>
> > A = matrix(c(5,0,4,1),nrow=2)
> > llr(A)
> [1] 1.497635
> > llr(t(A))
> [1] 1.497635
> >
>
> The same applies if we reverse the order of the rows.  In fact, I don't see
> how your can make this example asymmetrical.
>
> On the other hand, if we have, say, a million people and have 1000 interact
> with A and 10 interact with B and have 5 interact with A and B, then we
> have
> a somewhat different matrix:
>
>    B      -B
>  A  5     995
> -A  5 998,995
>
> But again, we get A related to B at the same level as B to A.  In R:
>
> > A = matrix(c(5, 5, 995, 998995), nrow=2)
> > A
>     [,1]   [,2]
> [1,]    5    995
> [2,]    5 998995
> > t(A)
>     [,1]   [,2]
> [1,]    5      5
> [2,]  995 998995
> > llr(A)
> [1] 55.24958
> > llr(t(A))
> [1] 55.24958
>
> Now asymmetry can creep in if we are limiting the number of non-zero
> elements in rows of the sparsified related items table.  Thus this score of
> 55.25 might be in the top 50 scores from A to B, but not in the top 50
> scores from B to A.
>
> But none of this explains why setting log(0) = 0 causes any problems or NaN
> results.  That means I am likely misunderstanding you completely.
>
> Here, btw, is the R definition of llr.  The use of (k==0) in llr.H is the
> equivalent of safeLog.
>
> > llr
> function(k) {
>  r = 2* sum(k) * (llr.H(k) - llr.H(rowSums(k)) - llr.H(colSums(k)))
>  if (r < 0 && r > -1e-12) {
>    r = 0
>  }
>  r
> }
>
> > llr.H
> function(k) {
>  N = sum(k)
>  sum(k/N * log(k/N + (k==0)))
> }
>
> On Sat, Jan 29, 2011 at 6:43 AM, Sean Owen <[email protected]> wrote:
> >
> > In LogLikelihoodSimilarity, you'll find a function safeLog() which
> returns
> > 0.0, rather than NaN, when the log of a non-positive number is computed.
> >
> > It creates an asymmetry in corner cases. For example, imagine we have 5
> > users. All 5 are associated to item A; all but one are associated to item
> B.
> > The similarity between 1 and 2 is 0.0, but the similarity between 2 and 1
> is
> > NaN.
> >
> > Taking off this safety feature makes both NaN. I think it's neither more
> or
> > less theoretically defensible to go either way. But in practice, it's
> > slightly bad as it means no similarity is available in some edge cases.
> >
> > My intuition says we should be wary of edge cases -- you can get funny
> > results. So part of me thinks it's good to turn these into NaN and ignore
> > them. They are after all 0.0 similar at the moment.
> >
> > Is my intuition roughly right?
>

Re: Question about log-likelihood formulation

Reply via email to