I don't think that this is correct.  The LLR computation should be
symmetric.  The purpose of safeLog is to allow lim_{x \goesto 0} x \log x =
0.
In your example I think that we have the following interaction matrix


   +  -
A  5  0
B  4  1

Using R, it is easy to show the the llr is the same transposed or not:

> A = matrix(c(5,0,4,1),nrow=2)
> llr(A)
[1] 1.497635
> llr(t(A))
[1] 1.497635
>

The same applies if we reverse the order of the rows.  In fact, I don't see
how your can make this example asymmetrical.

On the other hand, if we have, say, a million people and have 1000 interact
with A and 10 interact with B and have 5 interact with A and B, then we have
a somewhat different matrix:

    B      -B
 A  5     995
-A  5 998,995

But again, we get A related to B at the same level as B to A.  In R:

> A = matrix(c(5, 5, 995, 998995), nrow=2)
> A
     [,1]   [,2]
[1,]    5    995
[2,]    5 998995
> t(A)
     [,1]   [,2]
[1,]    5      5
[2,]  995 998995
> llr(A)
[1] 55.24958
> llr(t(A))
[1] 55.24958

Now asymmetry can creep in if we are limiting the number of non-zero
elements in rows of the sparsified related items table.  Thus this score of
55.25 might be in the top 50 scores from A to B, but not in the top 50
scores from B to A.

But none of this explains why setting log(0) = 0 causes any problems or NaN
results.  That means I am likely misunderstanding you completely.

Here, btw, is the R definition of llr.  The use of (k==0) in llr.H is the
equivalent of safeLog.

> llr
function(k) {
  r = 2* sum(k) * (llr.H(k) - llr.H(rowSums(k)) - llr.H(colSums(k)))
  if (r < 0 && r > -1e-12) {
    r = 0
  }
  r
}

> llr.H
function(k) {
  N = sum(k)
  sum(k/N * log(k/N + (k==0)))
}

On Sat, Jan 29, 2011 at 6:43 AM, Sean Owen <[email protected]> wrote:
>
> In LogLikelihoodSimilarity, you'll find a function safeLog() which returns
> 0.0, rather than NaN, when the log of a non-positive number is computed.
>
> It creates an asymmetry in corner cases. For example, imagine we have 5
> users. All 5 are associated to item A; all but one are associated to item
B.
> The similarity between 1 and 2 is 0.0, but the similarity between 2 and 1
is
> NaN.
>
> Taking off this safety feature makes both NaN. I think it's neither more
or
> less theoretically defensible to go either way. But in practice, it's
> slightly bad as it means no similarity is available in some edge cases.
>
> My intuition says we should be wary of edge cases -- you can get funny
> results. So part of me thinks it's good to turn these into NaN and ignore
> them. They are after all 0.0 similar at the moment.
>
> Is my intuition roughly right?

Reply via email to