[ 
https://issues.apache.org/jira/browse/MAHOUT-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885563#action_12885563
 ] 

Sebastian Schelter commented on MAHOUT-420:
-------------------------------------------

I did some local tests using the 100K movielens dataset, generating 10 
recommendations per user and having maxPrefsPerUserConsidered set to 25 and 
maxCooccurrencesPerItemConsidered/maxSimilaritiesPerItemConsidered set to 25. 
I checked the overall running time and the amount of data that was read and 
written in the partialMultiply and aggregateAndRecommend jobs.

The simple cooccurrence-based recommender finished in approximately one minute 
and read and wrote about 200MB in the partialMultiply and aggregateAndRecommend 
jobs. All of my patches needed
about 6 minutes and read and wrote 3-4 times as much data... I finally figured 
out that that huge difference was caused by me not pruning the vectors as it 
was done before in the UserVectorToCooccurrenceMapper.

I added that step and evolved the latest patch (the one that uses vectors 
instead of custom writables).

I got it to finish the job in one minute too and write about 400MB in the 
partialMultiply and 300MB in the aggregateAndRecommend step when the 
computation was done using pearson correlation as similarity . I tried to apply 
all optimizations you mentioned (like setWritesLaxPrecision(true) on the 
VectorWritables, no multiplication if the pref is 1 and a special computation 
method for boolean data). I also found a way to make the patch drop 
recommendations based on only one data point (the same thing 
GenericItemBasedRecommender.doEstimatePreference(...) is doing).

Are we on the right path and do you see more optimization potential?

> Improving the distributed item-based recommender
> ------------------------------------------------
>
>                 Key: MAHOUT-420
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-420
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-420-2.patch, MAHOUT-420-2a.patch, 
> MAHOUT-420-3.patch, MAHOUT-420.patch
>
>
> A summary of the discussion on the mailing list:
> Extend the distributed item-based recommender from using only simple 
> cooccurrence counts to using the standard computations of an item-based 
> recommender as defined in Sarwar et al "Item-Based Collaborative Filtering 
> Recommendation Algorithms" 
> (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.9927&rep=rep1&type=pdf).
> What the distributed recommender generally does is that it computes the 
> prediction values for all users towards all items those users have not rated 
> yet. And the computation is done in the following way:
>  u = a user
>  i = an item not yet rated by u
>  N = all items cooccurring with i
>  Prediction(u,i) = sum(all n from N: cooccurrences(i,n) * rating(u,n))
> The formula used in the paper which is used by 
> GenericItemBasedRecommender.doEstimatePreference(...) too, looks very similar 
> to the one above:
>  u = a user
>  i = an item not yet rated by u
>  N = all items similar to i (where similarity is usually computed by 
> pairwisely comparing the item-vectors of the user-item matrix)
>  Prediction(u,i) = sum(all n from N: similarity(i,n) * rating(u,n)) / sum(all 
> n from N: abs(similarity(i,n)))
> There are only 2 differences:
>  a) instead of the cooccurrence count, certain similarity measures like 
> pearson or cosine can be used
>  b) the resulting value is normalized by the sum of the similarities
> To overcome difference a) we would only need to replace the part that 
> computes the cooccurrence matrix with the code from ItemSimilarityJob or the 
> code introduced in MAHOUT-418, then we could compute arbitrary similarity 
> matrices and use them in the same way the cooccurrence matrix is currently 
> used. We just need to separate steps up to creating the co-occurrence matrix 
> from the rest, which is simple since they're already different MR jobs. 
> Regarding difference b) from a first look at the implementation I think it 
> should be possible to transfer the necessary similarity matrix entries from 
> the PartialMultiplyMapper to the AggregateAndRecommendReducer to be able to 
> compute the normalization value in the denominator of the formula. This will 
> take a little work, yes, but is still straightforward. It canbe in the 
> "common" part of the process, done after the similarity matrix is generated.
> I think work on this issue should wait until MAHOUT-418 is resolved as the 
> implementation here depends on how the pairwise similarities will be computed 
> in the future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to