If the question is, why do we output *coocccurrence* instead of something else, like the elements of a vector product in the paper, then the answer is that it's to support different similarity metrics. In both cases, the algorithms are building up item-item similarity by extracting a billion tiny bits of item-item similarity -- whether it's cooccurrence or a dot product -- and piecing it together later.
(I do think the generality comes at a price: it is outputting not just a count but a full record of every co-occurrence, which is big. That means that you can implement most any common metric downstream with that much info preserved.) What did you have in mind by map-side? You can't do this all map-side, and the paper suggests it emits all the "bits" from the mapper and then adds them up in the reducer, which is the same as in the RowSimilarityJob procedure. Unless I miss something significant, it's the same. If the question is, why do we output *all* cooccurrence then it's because in theory you need to count everything to compute all item-item similarities. None of those are somehow irrelevant or redundant. However, each individual cooccurrence may make very little difference in the final result. Pruning is just fine here, and RowSimilarityJob does *not* have levers to pull to throw away data at this stage, in order to emit fewer cooccurrences. It could. That's what the paper does pretty coarsely by just throwing away large posting lists; it would not be quite so valid for recommenders. Ignoring the word "the" in doc similarity is not the same as ignoring a very common album. But, there are simpler rules like what Ted's said a few times -- just cap the size of any column (yes, here we mean column, not row) by randomly tossing elements past a certain size. The combiner question is interesting. Many metrics happen to be built up from some commutative function of smaller parts, like addition. I think the answer is 'yes' in many cases -- but not in all cases. For example I don't think you can construe log-likelihood this way. I'm also not sure how Pearson would come out since you need the mean from the start, though I wouldn't be surprised if there's a clever way to compute it online. I think the big big win for your use case is adding that lever to trim down big columns, which is quite easy to add. On Mon, Jul 18, 2011 at 7:42 PM, Grant Ingersoll <[email protected]> wrote: > Been looking at this some more... > > I don't think the paper (or the other one I cited later which has more > details) output all this co-occurrence temporary stuff that we do. We seem > to be doing context.write() inside of an inner loop and simply outputting > that a relationship exists. I think the paper calculates as much as it can > of the score on the map side and then emits the partial score and then the > reducer does the final sum. If I'm reading our stuff right, I think we > output all the co-occurrences and then do the full sum on the reduce. > Perhaps we emit all the co-occurrences b/c that allows us to plug in > different similarity measures? See page 3 of the Jimmy Lin paper I emailed > the other day. > > Also, could we benefit from a Combiner in the similarity calculation phase? > That's all algebraic there, right? > > > > On Jul 14, 2011, at 6:26 PM, Sean Owen wrote: > >> What's a row here, a user? I completely agree but then this describes how >> you start item-item simiarity computation, where items are columns right? >> The job here is turned on its side, computing row similarity. >> On Jul 14, 2011 11:21 PM, "Ted Dunning" <[email protected]> wrote: >>> The problem arises when the program is reading a single row and emitting >> all >>> of the cooccurring items. The number of items emitted is the square of the >>> number of items in a row. Thus, it is more dense rows that cause the >>> problem. >>> >>> On Thu, Jul 14, 2011 at 2:25 PM, Sean Owen <[email protected]> wrote: >>> >>>> In their example, docs were rows and words were columns. The terms of >>>> the inner products they computed came from processing the posting >>>> lists / columns instead of rows and emitting all pairs of docs >>>> containing a word. Sounds like they just tossed the posting list for >>>> common words. Anyway that's why I said cols and think that's right. At >>>> least, that is what RowSimilartyJob is doing. >>>> >>>> On Thu, Jul 14, 2011 at 10:05 PM, Ted Dunning <[email protected]> >>>> wrote: >>>>> Rows. >>>>> >>>>> On Thu, Jul 14, 2011 at 12:24 PM, Sean Owen <[email protected]> wrote: >>>>> >>>>>> Just needs a rule for >>>>>> tossing data -- you could simply throw away such columns (ouch), or at >>>>>> least >>>>>> use only a sampled subset of it. >>>>>> >>>>> >>>> > > -------------------------- > Grant Ingersoll > > > >
