Hadoop return value maps LLR into 0..1 range—all questions are answered. So
none of this was a bug, just different way to return values.
On Jul 3, 2014, at 8:39 AM, Pat Ferrel <[email protected]> wrote:
This is solved with one question at the end. Basically there are at least 3
ways to calculate LLR and hadoop itemsimilarity is the odd one.
Looking at Ted’s github example, the counts seems to be taken from the
cooccurrences with the diagonal removed—so no self-cooc.
val AtAdNoSelfCooc = dense(
(0, 1, 0, 1, 0),
(1, 0, 0, 0, 0),
(0, 0, 0, 1, 0),
(1, 0, 1, 0, 0),
(0, 0, 0, 0, 0))
//again using (1,0)
// Ted’s code
for (MatrixSlice row : cooccurrence) {
for (Vector.Element element : row.vector().nonZeroes()) {
long k11 = (long) element.get();// = 1
long k12 = (long) (rowSums.get(row.index()) - k11);// = 0
long k21 = (long) (colSums.get(element.index()) - k11);// = 1
long k22 = (long) (total - k11 - k12 - k21);// = 2
// k =
// 1, 0
// 1, 2
double score = LogLikelihood.rootLogLikelihoodRatio(k11, k12,
k21, k22);
element.set(score);
}
}
So the k matrix looks correct if the above assumptions are correct. But the
Hadoop impl returns a slightly massaged value for LLR:
// mrlegacy code for itemsimilarity
double logLikelihood =
LogLikelihood.logLikelihoodRatio(preferring1and2,
preferring2 - preferring1and2,
preferring1 - preferring1and2,
numUsers - preferring1 - preferring2 +
preferring1and2);
return 1.0 - 1.0 / (1.0 + logLikelihood);
Notice no root LLR (same ranking so seems fine), also not sure why the 1 - 1…
but plugging that in the R calc yields 0.6331746 The same value as hadoop
itemsimilarity.
So the mystery is solved now the question is why the return "1.0 - 1.0 / (1.0 +
logLikelihood);”
I will assume that at least for comparison with legacy code we want to do this
but I’d like to know why.
Begin forwarded message:
From: Pat Ferrel <[email protected]>
Subject: LLR
Date: July 2, 2014 at 11:56:44 AM PDT
To: Ted Dunning <[email protected]>
Cc: Sebastian Schelter <[email protected]>
Might as well add myself to the list of people asking for an LLR explanation.
Hadoop itemsimilarity is returning different values than the Spark version on
the small matrix below. I’m having a hard time sorting this out so if you can
bear with me.
Let’s take the A’A case for simplicity. It looks like we want to calculate the
LLR for each non-zero entries in the AtA matrix using counts we got from A. For
example let’s take the case of item 1 = itemA and item 0 = itemB so the (1,0).
//input matrix rows = users, columns = items
val A = dense(
(1, 1, 0, 0, 0),
(0, 0, 1, 1, 0),
(0, 0, 0, 0, 1),
(1, 0, 0, 1, 0))
val AtA = A.transpose().times(A)
// AtA == AtAd:
val AtAd = dense(
(2, 1, 0, 1, 0),
(1, 1, 0, 0, 0),
(0, 0, 1, 1, 0),
(1, 0, 1, 2, 0),
(0, 0, 0, 0, 1))
It looks like Spark cooccurrence calculates for itemA = 1, itemB = 0, k =
<mahout] 2014-07-02 11-29-11 2014-07-02 11-32-00.jpg>
using hadoop itemsimilairty I get 0.6331745808516107, using the above k and
rootLogLikelihoodRatio I get 1.3138083706198118, using logLikelihoodRatio it
comes out (not surprisingly) 1.7260924347106847, which agrees with the R
version from the Ted's blog. So either k is wrong or I’ve missed some other
difference in hadoop v spark versions. I assume root or not doesn’t matter
since the ranking is the same.
if you could tell me what are k11 … k22 for item (1,0) of AtA and how did you
calculate them.