I don't think that this is correct. The LLR computation should be
symmetric. The purpose of safeLog is to allow lim_{x \goesto 0} x \log x =
0.
In your example I think that we have the following interaction matrix
+ -
A 5 0
B 4 1
Using R, it is easy to show the the llr is the same transposed or not:
> A = matrix(c(5,0,4,1),nrow=2)
> llr(A)
[1] 1.497635
> llr(t(A))
[1] 1.497635
>
The same applies if we reverse the order of the rows. In fact, I don't see
how your can make this example asymmetrical.
On the other hand, if we have, say, a million people and have 1000 interact
with A and 10 interact with B and have 5 interact with A and B, then we have
a somewhat different matrix:
B -B
A 5 995
-A 5 998,995
But again, we get A related to B at the same level as B to A. In R:
> A = matrix(c(5, 5, 995, 998995), nrow=2)
> A
[,1] [,2]
[1,] 5 995
[2,] 5 998995
> t(A)
[,1] [,2]
[1,] 5 5
[2,] 995 998995
> llr(A)
[1] 55.24958
> llr(t(A))
[1] 55.24958
Now asymmetry can creep in if we are limiting the number of non-zero
elements in rows of the sparsified related items table. Thus this score of
55.25 might be in the top 50 scores from A to B, but not in the top 50
scores from B to A.
But none of this explains why setting log(0) = 0 causes any problems or NaN
results. That means I am likely misunderstanding you completely.
Here, btw, is the R definition of llr. The use of (k==0) in llr.H is the
equivalent of safeLog.
> llr
function(k) {
r = 2* sum(k) * (llr.H(k) - llr.H(rowSums(k)) - llr.H(colSums(k)))
if (r < 0 && r > -1e-12) {
r = 0
}
r
}
> llr.H
function(k) {
N = sum(k)
sum(k/N * log(k/N + (k==0)))
}
On Sat, Jan 29, 2011 at 6:43 AM, Sean Owen <[email protected]> wrote:
>
> In LogLikelihoodSimilarity, you'll find a function safeLog() which returns
> 0.0, rather than NaN, when the log of a non-positive number is computed.
>
> It creates an asymmetry in corner cases. For example, imagine we have 5
> users. All 5 are associated to item A; all but one are associated to item
B.
> The similarity between 1 and 2 is 0.0, but the similarity between 2 and 1
is
> NaN.
>
> Taking off this safety feature makes both NaN. I think it's neither more
or
> less theoretically defensible to go either way. But in practice, it's
> slightly bad as it means no similarity is available in some edge cases.
>
> My intuition says we should be wary of edge cases -- you can get funny
> results. So part of me thinks it's good to turn these into NaN and ignore
> them. They are after all 0.0 similar at the moment.
>
> Is my intuition roughly right?