[CONF] Apache Mahout > Itembased Collaborative Filtering

confluence Thu, 14 Oct 2010 13:03:26 -0700

Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Itembased Collaborative Filtering 
(https://cwiki.apache.org/confluence/display/MAHOUT/Itembased+Collaborative+Filtering)



Edited by Sebastian Schelter:
---------------------------------------------------------------------
Itembased Collaborative Filtering is a popular way of doing Recommendation 
Mining.

h3. Terminology

We have *users* that interact with *items* (which can be pretty much anything 
like books, videos, news, other users,...). Those users express *preferences* 
towards the items which can either be boolean (just modelling that a user likes 
an item) or numeric (by having a rating value assigned to the preference). 
Typically only a small number of preferences is known for each single user.

h3. Algorithmic problems

Collaborative Filtering algorithms aim to solve the *prediction* problem where 
the task is to estimate the preference of a user towards an item which he/she 
has not yet seen.

Once an algorithm can predict preferences it can also be used to do 
*Top-N-Recommendation* where the task is to find the N items a given user might 
like best. This is usually done by isolating a set of candidate items, 
computing the predicted preferences of the given user towards them and 
returning the highest scoring ones.

If we look at the problem from a mathematical perspective, a *user-item-matrix* 
is created from the preference data and the task is to predict the missing 
entries by finding patterns in the known entries.

h3. Itembased Collaborative Filtering

A popular approach called "Itembased Collaborative Filtering" estimates a 
user's preference towards an item by looking at his/her preferences towards 
similar items, be aware that similarity must be thought of as similarity of 
rating behaviour not similarity of content in this context.

The standard procedure is to pairwisely compare the columns of the 
user-item-matrix (the item-vectors) using a similarity measure like 
pearson-correlation, cosine or loglikelihood to obtain similar items and use 
those together with the user's ratings to predict his/her preference towards 
unknown items.


h3. Map/Reduce implementations

Mahout offers two Map/Reduce jobs aimed to support Itembased Collaborative 
Filtering.

*org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob* computes 
all similar items. It expects a .csv file with the preference data as input, 
where each line represents a single preference in the form 
_userID,itemID,value_ and outputs pairs of itemIDs with their associated 
similarity value.

{code}
  --input (-i) input                                        Path to job input 
directory.
  --output (-o) output                                      The directory 
pathname for output.
  --similarityClassname (-s) similarityClassname            Name of distributed 
similarity class to instantiate,
                                                            alternatively use 
one of the predefined similarities
                                                            
(SIMILARITY_COOCCURRENCE, SIMILARITY_EUCLIDEAN_DISTANCE,
                                                            
SIMILARITY_LOGLIKELIHOOD, SIMILARITY_PEARSON_CORRELATION,
                                                            
SIMILARITY_TANIMOTO_COEFFICIENT, SIMILARITY_UNCENTERED_COSINE,
                                                            
SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE)
  --maxSimilaritiesPerItem (-m) maxSimilaritiesPerItem      try to cap the 
number of similar items per item to this
                                                            number (default: 
100)
  --maxCooccurrencesPerItem (-o) maxCooccurrencesPerItem    try to cap the 
number of cooccurrences per item to this
                                                            number (default: 
100)
  --booleanData (-b) booleanData                            Treat input as 
without pref values
{code}

*org.apache.mahout.cf.taste.hadoop.item.RecommenderJob* is a completely 
distributed itembased recommender. It expects a .csv file with the preference 
data as input, where each line represents a single preference in the form 
_userID,itemID,value_ and outputs userIDs with associated recommended itemIDs 
and their scores.

{code}
  --input (-i) input                                        Path to job input 
directory.
  --output (-o) output                                      The directory 
pathname for output.
  --numRecommendations (-n) numRecommendations              Number of 
recommendations per user
  --usersFile (-u) usersFile                                File of users to 
recommend for
  --itemsFile (-i) itemsFile                                File of items to 
recommend for
  --filterFile (-f) filterFile                              File containing 
comma-separated userID,itemID pairs. Used to
                                                            exclude the item 
from the recommendations for that user
                                                            (optional)
  --booleanData (-b) booleanData                            Treat input as 
without pref values
  --maxPrefsPerUser maxPrefsPerUser                         Maximum number of 
preferences considered per user in final
                                                            recommendation phase
  --maxSimilaritiesPerItem maxSimilaritiesPerItem           Maximum number of 
similarities considered per item
  --maxCooccurrencesPerItem (-o) maxCooccurrencesPerItem    try to cap the 
number of cooccurrences per item to this
                                                            number (default: 
100)
  --similarityClassname (-s) similarityClassname            Name of distributed 
similarity class to instantiate,
                                                            alternatively use 
one of the predefined similarities
                                                            
(SIMILARITY_COOCCURRENCE, SIMILARITY_EUCLIDEAN_DISTANCE,
                                                            
SIMILARITY_LOGLIKELIHOOD, SIMILARITY_PEARSON_CORRELATION,
                                                            
SIMILARITY_TANIMOTO_COEFFICIENT, SIMILARITY_UNCENTERED_COSINE,
                                                            
SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE)
{code}

TODO: add more details

h3. Resources

* [Sarwar et al.:Item-Based Collaborative Filtering Recommendation Algorithms 
|http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.9927&rep=rep1&type=pdf]
* [Slides: Distributed Itembased Collaborative Filtering with Apache 
Mahout|http://www.slideshare.net/sscdotopen/mahoutcf]

Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action

[CONF] Apache Mahout > Itembased Collaborative Filtering

Reply via email to