Exactly right. The co-occurrence matrix is already used as a sort of similarity matrix. And yes in any event it really ought to be normalized.
I'd be pleased for you to try making these changes. I don't think there's really anything to do for a). We just need to separate steps up to creating the co-occurrence matrix from the rest, which is simple since they're already different MR jobs. Perhaps you see a nice way to structure this code. b) will take a little work, yes, but is still straightforward. It can be in the "common" part of the process, done after the similarity matrix is generated. On Mon, Jun 21, 2010 at 9:08 AM, Sebastian Schelter <[email protected]> wrote: > I have had some time this weekend to take a deeper look at Sean's slides > from Berlin buzzwords, where he explains the math behind the distributed > item-based recommender. I think I found a way to extend it from using > only simple cooccurrence counts to using the standard computations of an > item-based recommender as defined in Sarwar et al "Item-Based > Collaborative Filtering Recommendation Algorithms" > (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.9927&rep=rep1&type=pdf). > > > I'd be happy to see someone check and validate my thoughts! > > If I understand the distributed recommender correctly, what it generally > does is that it computes the prediction values for all users towards all > items those users have not rated yet. And the computation is done in the > following way: > > u = a user > i = an item not yet rated by u > N = all items cooccurring with i > > Prediction(u,i) = sum(all n from N: cooccurrences(i,n) * rating(u,n)) > > The formula used in the paper which is used by > GenericItemBasedRecommender.doEstimatePreference(...) too, looks very > similar to the one above: > > u = a user > i = an item not yet rated by u > N = all items similar to i (where similarity is usually computed by > pairwisely comparing the item-vectors of the user-item matrix) > > Prediction(u,i) = sum(all n from N: similarity(i,n) * rating(u,n)) / > sum(all n from N: abs(similarity(i,n))) > > There are only 2 differences: > a) instead of the cooccurrence count, certain similarity measures like > pearson or cosine can be used > b) the resulting value is normalized by the sum of the similarities > > to overcome difference a) we would only need to replace the part that > computes the cooccurrence matrix with the code from ItemSimilarityJob or > the code introduced in MAHOUT-418, then we could compute arbitrary > similarity matrices and use them in the same way the cooccurrence matrix > is currently used > > Regarding difference b) from a first look at the implementation I think > it should be possible to transfer the necessary similarity matrix > entries from the PartialMultiplyMapper to the > AggregateAndRecommendReducer to be able to compute the normalization > value in the denominator of the formula. > > -sebastian >
