[
https://issues.apache.org/jira/browse/MAHOUT-1004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kris Jack updated MAHOUT-1004:
------------------------------
Affects Version/s: 0.7
Status: Patch Available (was: Open)
First attempt to refactor the item-based code to include user-based
functionality. I have only written high level unit tests that demonstrate that
the whole job runs so far because I think we're still at the stage when we need
to discuss whether this refactoring is appropriate or not. I have fixed an
existing unit test that was failing because of my changes though
(RecommenderJobTest).
Here's a description of the changes...
org.apache.mahout.cf.taste.hadoop.user.RecommenderJob has been created. It's
similar to the item-based version of this job and the two could conceivably be
merged together or rely upon common classes. I've left it as a separate job
for the moment so as not to make this already fat patch even fatter.
org.apache.mahout.cf.taste.hadoop.preparation.PreparePreferenceMatrixJob has
the option to skip encoding item ids as ints. I had to remove the int-long
mapping because it wouldn't fit in memory when I was running large scale
experiments. I understand that you want to keep it in so that it's compatible
with the Taste code so I've just added a flag that allows it to be skipped.
org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob can be
column or row centric with the option "outputColumns":
- 2 versions of VectorNormMapper have been created, ColumnVectorNormMapper and
RowVectorNormMapper, one is needed for item-based, the other for user-based. I
pulled these out to make the separate classes and removed the VectorNormMapper.
- pairwiseSimilarity operates on either the weightsPath or the input path, the
first being needed for item-based, the second for user-based
org.apache.mahout.cf.taste.hadoop.item.UserVectorSplitterMapper can now run as
item-based or user-based:
- item-based is the default behaviour and does not change the item-based
algorithm
- user-based changes what is output by the mapper, swapping the item index and
the user id around
I need to break the final PartialMultiplyMapper-AggregateAndRecommendReducer
step in the item-based recommender into 2 MR jobs for the user-based version.
This is because we need an extra transposition step since the results are
generated keyed on item rather than user. The new MR jobs are:
- PartialMultiplyMapper-VectorAdditionReducer that multiplies and adds vectors
together
- TransposeMapper-VectorToRecommendationsReducer that transposes the results
and outputs them as RecommendedItemsWritable
As you can imagine though, lots of the code in these new MR Jobs is the same as
the original AggregateAndRecommendReducer job so I pulled out 2 common classes,
naming them VectorAdditionUtils and RecommendationsWriter. I haven't
refactored the AggregateAndRecommendReducer job but it would be relatively
straight-forward to eliminate code duplicate here by having it call the new
VectorAdditionUtils and RecommendationsWriter.
Also, so far, I have only been testing this with the co-occurrence metric. I'm
not sure if the weights are being calculated correctly in the user-based
version of the RowSimilarityJob.
> Distributed User-based Collaborative Filtering
> ----------------------------------------------
>
> Key: MAHOUT-1004
> URL: https://issues.apache.org/jira/browse/MAHOUT-1004
> Project: Mahout
> Issue Type: New Feature
> Components: Collaborative Filtering
> Affects Versions: 0.7
> Reporter: Kris Jack
> Assignee: Sean Owen
> Priority: Minor
> Labels: Recommender, User-based
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> I'd like to contribute code that implements a distributed user-based
> collaborative filtering algorithm.
> In brief, so far I've taken the code for the existing
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob and created a new
> org.apache.mahout.cf.taste.hadoop.user.RecommenderJob. With help from Sean
> Owen, I followed a similar approach to the item-based implementation, but
> multiplied a user-user matrix with a user-item vector rather than an
> item-item matrix with an item-user vector. The result of the multiplication
> then needs to be transposed in order to output recommendations by user id.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira