[
https://issues.apache.org/jira/browse/MAHOUT-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen updated MAHOUT-420:
-----------------------------
Status: Resolved (was: Patch Available)
Assignee: Sean Owen
Fix Version/s: 0.4
Resolution: Fixed
> Improving the distributed item-based recommender
> ------------------------------------------------
>
> Key: MAHOUT-420
> URL: https://issues.apache.org/jira/browse/MAHOUT-420
> Project: Mahout
> Issue Type: Improvement
> Components: Collaborative Filtering
> Reporter: Sebastian Schelter
> Assignee: Sean Owen
> Fix For: 0.4
>
> Attachments: MAHOUT-420-2.patch, MAHOUT-420-2a.patch,
> MAHOUT-420-3.patch, MAHOUT-420.patch
>
>
> A summary of the discussion on the mailing list:
> Extend the distributed item-based recommender from using only simple
> cooccurrence counts to using the standard computations of an item-based
> recommender as defined in Sarwar et al "Item-Based Collaborative Filtering
> Recommendation Algorithms"
> (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.9927&rep=rep1&type=pdf).
> What the distributed recommender generally does is that it computes the
> prediction values for all users towards all items those users have not rated
> yet. And the computation is done in the following way:
> u = a user
> i = an item not yet rated by u
> N = all items cooccurring with i
> Prediction(u,i) = sum(all n from N: cooccurrences(i,n) * rating(u,n))
> The formula used in the paper which is used by
> GenericItemBasedRecommender.doEstimatePreference(...) too, looks very similar
> to the one above:
> u = a user
> i = an item not yet rated by u
> N = all items similar to i (where similarity is usually computed by
> pairwisely comparing the item-vectors of the user-item matrix)
> Prediction(u,i) = sum(all n from N: similarity(i,n) * rating(u,n)) / sum(all
> n from N: abs(similarity(i,n)))
> There are only 2 differences:
> a) instead of the cooccurrence count, certain similarity measures like
> pearson or cosine can be used
> b) the resulting value is normalized by the sum of the similarities
> To overcome difference a) we would only need to replace the part that
> computes the cooccurrence matrix with the code from ItemSimilarityJob or the
> code introduced in MAHOUT-418, then we could compute arbitrary similarity
> matrices and use them in the same way the cooccurrence matrix is currently
> used. We just need to separate steps up to creating the co-occurrence matrix
> from the rest, which is simple since they're already different MR jobs.
> Regarding difference b) from a first look at the implementation I think it
> should be possible to transfer the necessary similarity matrix entries from
> the PartialMultiplyMapper to the AggregateAndRecommendReducer to be able to
> compute the normalization value in the denominator of the formula. This will
> take a little work, yes, but is still straightforward. It canbe in the
> "common" part of the process, done after the similarity matrix is generated.
> I think work on this issue should wait until MAHOUT-418 is resolved as the
> implementation here depends on how the pairwise similarities will be computed
> in the future.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.