Exactly right. The co-occurrence matrix is already used as a sort of
similarity matrix. And yes in any event it really ought to be
normalized.

I'd be pleased for you to try making these changes.

I don't think there's really anything to do for a). We just need to
separate steps up to creating the co-occurrence matrix from the rest,
which is simple since they're already different MR jobs. Perhaps you
see a nice way to structure this code.

b) will take a little work, yes, but is still straightforward. It can
be in the "common" part of the process, done after the similarity
matrix is generated.

On Mon, Jun 21, 2010 at 9:08 AM, Sebastian Schelter
<[email protected]> wrote:
> I have had some time this weekend to take a deeper look at Sean's slides
> from Berlin buzzwords, where he explains the math behind the distributed
> item-based recommender. I think I found a way to extend it from using
> only simple cooccurrence counts to using the standard computations of an
> item-based recommender as defined in Sarwar et al "Item-Based
> Collaborative Filtering Recommendation Algorithms"
> (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.9927&rep=rep1&type=pdf).
>
>
> I'd be happy to see someone check and validate my thoughts!
>
> If I understand the distributed recommender correctly, what it generally
> does is that it computes the prediction values for all users towards all
> items those users have not rated yet. And the computation is done in the
> following way:
>
>  u = a user
>  i = an item not yet rated by u
>  N = all items cooccurring with i
>
>  Prediction(u,i) = sum(all n from N: cooccurrences(i,n) * rating(u,n))
>
> The formula used in the paper which is used by
> GenericItemBasedRecommender.doEstimatePreference(...) too, looks very
> similar to the one above:
>
>  u = a user
>  i = an item not yet rated by u
>  N = all items similar to i (where similarity is usually computed by
> pairwisely comparing the item-vectors of the user-item matrix)
>
>  Prediction(u,i) = sum(all n from N: similarity(i,n) * rating(u,n)) /
> sum(all n from N: abs(similarity(i,n)))
>
> There are only 2 differences:
>  a) instead of the cooccurrence count, certain similarity measures like
> pearson or cosine can be used
>  b) the resulting value is normalized by the sum of the similarities
>
> to overcome difference a) we would only need to replace the part that
> computes the cooccurrence matrix with the code from ItemSimilarityJob or
> the code introduced in MAHOUT-418, then we could compute arbitrary
> similarity matrices and use them in the same way the cooccurrence matrix
> is currently used
>
> Regarding difference b) from a first look at the implementation I think
> it should be possible to transfer the necessary similarity matrix
> entries from the PartialMultiplyMapper to the
> AggregateAndRecommendReducer to be able to compute the normalization
> value in the denominator of the formula.
>
> -sebastian
>

Reply via email to