Re: LLR

Pat Ferrel Thu, 03 Jul 2014 12:58:22 -0700

Hadoop return value maps LLR into 0..1 range—all questions are answered. So 
none of this was a bug, just different way to return values. 
 
On Jul 3, 2014, at 8:39 AM, Pat Ferrel <[email protected]> wrote:


This is solved with one question at the end. Basically there are at least 3 
ways to calculate LLR and hadoop itemsimilarity is the odd one.

Looking at Ted’s github example, the counts seems to be taken from the 
cooccurrences with the diagonal removed—so no self-cooc.

         val AtAdNoSelfCooc = dense(
         (0, 1, 0, 1, 0),
         (1, 0, 0, 0, 0),
         (0, 0, 0, 1, 0),
         (1, 0, 1, 0, 0),
         (0, 0, 0, 0, 0))

        //again using (1,0)

        // Ted’s code
        for (MatrixSlice row : cooccurrence) {
            for (Vector.Element element : row.vector().nonZeroes()) {
                long k11 = (long) element.get();// = 1
                long k12 = (long) (rowSums.get(row.index()) - k11);// =  0
                long k21 = (long) (colSums.get(element.index()) - k11);// = 1
                long k22 = (long) (total - k11 - k12 - k21);// = 2
                // k = 
                //       1, 0
                //       1, 2
                double score = LogLikelihood.rootLogLikelihoodRatio(k11, k12, 
k21, k22);
                element.set(score);
            }
        }

So the k matrix looks correct if the above assumptions are correct. But the 
Hadoop impl returns a slightly massaged value for LLR:

    // mrlegacy code for itemsimilarity
    double logLikelihood =
        LogLikelihood.logLikelihoodRatio(preferring1and2,
                                         preferring2 - preferring1and2,
                                         preferring1 - preferring1and2,
                                         numUsers - preferring1 - preferring2 + 
preferring1and2);
    return 1.0 - 1.0 / (1.0 + logLikelihood);

Notice no root LLR (same ranking so seems fine), also not sure why the 1 - 1… 
but plugging that in the R calc yields 0.6331746 The same value as hadoop 
itemsimilarity.

So the mystery is solved now the question is why the return "1.0 - 1.0 / (1.0 + 
logLikelihood);”

I will assume that at least for comparison with legacy code we want to do this 
but I’d like to know why.
 


Begin forwarded message:

From: Pat Ferrel <[email protected]>
Subject: LLR
Date: July 2, 2014 at 11:56:44 AM PDT
To: Ted Dunning <[email protected]>
Cc: Sebastian Schelter <[email protected]>

Might as well add myself to the list of people asking for an LLR explanation. 
Hadoop itemsimilarity is returning different values than the Spark version on 
the small matrix below. I’m having a hard time sorting this out so if you can 
bear with me.

Let’s take the A’A case for simplicity. It looks like we want to calculate the 
LLR for each non-zero entries in the AtA matrix using counts we got from A. For 
example let’s take the case of item 1 = itemA and item 0 = itemB so the (1,0).

    //input matrix rows = users, columns = items
    val A = dense(
        (1, 1, 0, 0, 0),
        (0, 0, 1, 1, 0),
        (0, 0, 0, 0, 1),
        (1, 0, 0, 1, 0))

    val AtA = A.transpose().times(A)

    // AtA == AtAd:
    val AtAd = dense(
         (2, 1, 0, 1, 0),
         (1, 1, 0, 0, 0),
         (0, 0, 1, 1, 0),
         (1, 0, 1, 2, 0),
         (0, 0, 0, 0, 1))


It looks like Spark cooccurrence calculates for itemA = 1, itemB = 0, k = 
<mahout] 2014-07-02 11-29-11 2014-07-02 11-32-00.jpg>

using hadoop itemsimilairty I get 0.6331745808516107, using the above k and 
rootLogLikelihoodRatio I get 1.3138083706198118, using logLikelihoodRatio it 
comes out (not surprisingly) 1.7260924347106847, which agrees with the R 
version from the Ted's blog. So either k is wrong or I’ve missed some other 
difference in hadoop v spark versions. I assume root or not doesn’t matter 
since the ranking is the same.

if you could tell me what are k11 … k22 for item (1,0) of AtA and how did you 
calculate them.

Re: LLR

Reply via email to