[ 
https://issues.apache.org/jira/browse/MAHOUT-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884636#action_12884636
 ] 

Sean Owen commented on MAHOUT-420:
----------------------------------

Now that I'm looking at the patch I have a number of question. It seems to be 
changing many key points of the job, and I wish to see that the functionality 
and optimizations are not being lost.

I'm not fully understanding is handling of NaN. You see what was done before -- 
NaN values in vectors were used to exclude items from recommendation. It's a 
reasonably nice way to do it. What's the equivalent here? I see other bits of 
code paying attention to NaN.

Are we handling "boolean" preferences efficiently? Before it would avoid the 
vector-times-preference step when the pref was known to be 1.0, and I don't see 
that now.

Finally there is a feature in vectors that will save space, causing it to write 
float values instead of doubles, since we don't need 64 bits of precision. I 
also don't see how that's preserved.

Basically I am not yet sure how the new computation is structured from reading 
the code. I think some comments on the "Aggregate" jobs would be ideal. 

It's also a big task to test but my concern is how fast this runs now. I got to 
about 700 hours CPU for 5.7 million users / 130M ratings and I'm afraid that it 
can easily go up by orders of magnitude if some of the optimizations aren't 
here.

> Improving the distributed item-based recommender
> ------------------------------------------------
>
>                 Key: MAHOUT-420
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-420
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-420-2.patch, MAHOUT-420.patch
>
>
> A summary of the discussion on the mailing list:
> Extend the distributed item-based recommender from using only simple 
> cooccurrence counts to using the standard computations of an item-based 
> recommender as defined in Sarwar et al "Item-Based Collaborative Filtering 
> Recommendation Algorithms" 
> (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.9927&rep=rep1&type=pdf).
> What the distributed recommender generally does is that it computes the 
> prediction values for all users towards all items those users have not rated 
> yet. And the computation is done in the following way:
>  u = a user
>  i = an item not yet rated by u
>  N = all items cooccurring with i
>  Prediction(u,i) = sum(all n from N: cooccurrences(i,n) * rating(u,n))
> The formula used in the paper which is used by 
> GenericItemBasedRecommender.doEstimatePreference(...) too, looks very similar 
> to the one above:
>  u = a user
>  i = an item not yet rated by u
>  N = all items similar to i (where similarity is usually computed by 
> pairwisely comparing the item-vectors of the user-item matrix)
>  Prediction(u,i) = sum(all n from N: similarity(i,n) * rating(u,n)) / sum(all 
> n from N: abs(similarity(i,n)))
> There are only 2 differences:
>  a) instead of the cooccurrence count, certain similarity measures like 
> pearson or cosine can be used
>  b) the resulting value is normalized by the sum of the similarities
> To overcome difference a) we would only need to replace the part that 
> computes the cooccurrence matrix with the code from ItemSimilarityJob or the 
> code introduced in MAHOUT-418, then we could compute arbitrary similarity 
> matrices and use them in the same way the cooccurrence matrix is currently 
> used. We just need to separate steps up to creating the co-occurrence matrix 
> from the rest, which is simple since they're already different MR jobs. 
> Regarding difference b) from a first look at the implementation I think it 
> should be possible to transfer the necessary similarity matrix entries from 
> the PartialMultiplyMapper to the AggregateAndRecommendReducer to be able to 
> compute the normalization value in the denominator of the formula. This will 
> take a little work, yes, but is still straightforward. It canbe in the 
> "common" part of the process, done after the similarity matrix is generated.
> I think work on this issue should wait until MAHOUT-418 is resolved as the 
> implementation here depends on how the pairwise similarities will be computed 
> in the future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to