[jira] [Updated] (MAHOUT-1004) Distributed User-based Collaborative Filtering

Kris Jack (JIRA) Thu, 03 May 2012 03:49:18 -0700

     [ 
https://issues.apache.org/jira/browse/MAHOUT-1004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kris Jack updated MAHOUT-1004:
------------------------------

    Affects Version/s: 0.7
               Status: Patch Available  (was: Open)

First attempt to refactor the item-based code to include user-based 
functionality.  I have only written high level unit tests that demonstrate that 
the whole job runs so far because I think we're still at the stage when we need 
to discuss whether this refactoring is appropriate or not.  I have fixed an 
existing unit test that was failing because of my changes though 
(RecommenderJobTest).

Here's a description of the changes...

org.apache.mahout.cf.taste.hadoop.user.RecommenderJob has been created.  It's 
similar to the item-based version of this job and the two could conceivably be 
merged together or rely upon common classes.  I've left it as a separate job 
for the moment so as not to make this already fat patch even fatter.

org.apache.mahout.cf.taste.hadoop.preparation.PreparePreferenceMatrixJob has 
the option to skip encoding item ids as ints.  I had to remove the int-long 
mapping because it wouldn't fit in memory when I was running large scale 
experiments.  I understand that you want to keep it in so that it's compatible 
with the Taste code so I've just added a flag that allows it to be skipped.

org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob can be 
column or row centric with the option "outputColumns":
- 2 versions of VectorNormMapper have been created, ColumnVectorNormMapper and 
RowVectorNormMapper, one is needed for item-based, the other for user-based.  I 
pulled these out to make the separate classes and removed the VectorNormMapper.
- pairwiseSimilarity operates on either the weightsPath or the input path, the 
first being needed for item-based, the second for user-based

org.apache.mahout.cf.taste.hadoop.item.UserVectorSplitterMapper can now run as 
item-based or user-based:
- item-based is the default behaviour and does not change the item-based 
algorithm
- user-based changes what is output by the mapper, swapping the item index and 
the user id around

I need to break the final PartialMultiplyMapper-AggregateAndRecommendReducer 
step in the item-based recommender into 2 MR jobs for the user-based version.  
This is because we need an extra transposition step since the results are 
generated keyed on item rather than user.  The new MR jobs are:
- PartialMultiplyMapper-VectorAdditionReducer that multiplies and adds vectors 
together
- TransposeMapper-VectorToRecommendationsReducer that transposes the results 
and outputs them as RecommendedItemsWritable
As you can imagine though, lots of the code in these new MR Jobs is the same as 
the original AggregateAndRecommendReducer job so I pulled out 2 common classes, 
naming them VectorAdditionUtils and RecommendationsWriter.  I haven't 
refactored the AggregateAndRecommendReducer job but it would be relatively 
straight-forward to eliminate code duplicate here by having it call the new 
VectorAdditionUtils and RecommendationsWriter.

Also, so far, I have only been testing this with the co-occurrence metric.  I'm 
not sure if the weights are being calculated correctly in the user-based 
version of the RowSimilarityJob.
                
> Distributed User-based Collaborative Filtering
> ----------------------------------------------
>
>                 Key: MAHOUT-1004
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1004
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>    Affects Versions: 0.7
>            Reporter: Kris Jack
>            Assignee: Sean Owen
>            Priority: Minor
>              Labels: Recommender, User-based
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> I'd like to contribute code that implements a distributed user-based 
> collaborative filtering algorithm.
> In brief, so far I've taken the code for the existing 
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob and created a new 
> org.apache.mahout.cf.taste.hadoop.user.RecommenderJob.  With help from Sean 
> Owen, I followed a similar approach to the item-based implementation, but 
> multiplied a user-user matrix with a user-item vector rather than an 
> item-item matrix with an item-user vector.  The result of the multiplication 
> then needs to be transposed in order to output recommendations by user id.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-1004) Distributed User-based Collaborative Filtering

Reply via email to