Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Itembased Collaborative Filtering
(https://cwiki.apache.org/confluence/display/MAHOUT/Itembased+Collaborative+Filtering)
Edited by Sebastian Schelter:
---------------------------------------------------------------------
Itembased Collaborative Filtering is a popular way of doing Recommendation
Mining.
h3. Terminology
We have *users* that interact with *items* (which can be pretty much anything
like books, videos, news, other users,...). Those users express *preferences*
towards the items which can either be boolean (just modelling that a user likes
an item) or numeric (by having a rating value assigned to the preference).
Typically only a small number of preferences is known for each single user.
h3. Algorithmic problems
Collaborative Filtering algorithms aim to solve the *prediction* problem where
the task is to estimate the preference of a user towards an item which he/she
has not yet seen.
Once an algorithm can predict preferences it can also be used to do
*Top-N-Recommendation* where the task is to find the N items a given user might
like best. This is usually done by isolating a set of candidate items,
computing the predicted preferences of the given user towards them and
returning the highest scoring ones.
If we look at the problem from a mathematical perspective, a *user-item-matrix*
is created from the preference data and the task is to predict the missing
entries by finding patterns in the known entries.
h3. Itembased Collaborative Filtering
A popular approach called "Itembased Collaborative Filtering" estimates a
user's preference towards an item by looking at his/her preferences towards
similar items, be aware that similarity must be thought of as similarity of
rating behaviour not similarity of content in this context.
The standard procedure is to pairwisely compare the columns of the
user-item-matrix (the item-vectors) using a similarity measure like
pearson-correlation, cosine or loglikelihood to obtain similar items and use
those together with the user's ratings to predict his/her preference towards
unknown items.
h3. Map/Reduce implementations
Mahout offers two Map/Reduce jobs aimed to support Itembased Collaborative
Filtering.
*org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob* computes
all similar items. It expects a .csv file with the preference data as input,
where each line represents a single preference in the form
_userID,itemID,value_ and outputs pairs of itemIDs with their associated
similarity value.
{code}
--input (-i) input Path to job input
directory.
--output (-o) output The directory
pathname for output.
--similarityClassname (-s) similarityClassname Name of distributed
similarity class to instantiate,
alternatively use
one of the predefined similarities
(SIMILARITY_COOCCURRENCE, SIMILARITY_EUCLIDEAN_DISTANCE,
SIMILARITY_LOGLIKELIHOOD, SIMILARITY_PEARSON_CORRELATION,
SIMILARITY_TANIMOTO_COEFFICIENT, SIMILARITY_UNCENTERED_COSINE,
SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE)
--maxSimilaritiesPerItem (-m) maxSimilaritiesPerItem try to cap the
number of similar items per item to this
number (default:
100)
--maxCooccurrencesPerItem (-o) maxCooccurrencesPerItem try to cap the
number of cooccurrences per item to this
number (default:
100)
--booleanData (-b) booleanData Treat input as
without pref values
{code}
*org.apache.mahout.cf.taste.hadoop.item.RecommenderJob* is a completely
distributed itembased recommender. It expects a .csv file with the preference
data as input, where each line represents a single preference in the form
_userID,itemID,value_ and outputs userIDs with associated recommended itemIDs
and their scores.
{code}
--input (-i) input Path to job input
directory.
--output (-o) output The directory
pathname for output.
--numRecommendations (-n) numRecommendations Number of
recommendations per user
--usersFile (-u) usersFile File of users to
recommend for
--itemsFile (-i) itemsFile File of items to
recommend for
--filterFile (-f) filterFile File containing
comma-separated userID,itemID pairs. Used to
exclude the item
from the recommendations for that user
(optional)
--booleanData (-b) booleanData Treat input as
without pref values
--maxPrefsPerUser maxPrefsPerUser Maximum number of
preferences considered per user in final
recommendation phase
--maxSimilaritiesPerItem maxSimilaritiesPerItem Maximum number of
similarities considered per item
--maxCooccurrencesPerItem (-o) maxCooccurrencesPerItem try to cap the
number of cooccurrences per item to this
number (default:
100)
--similarityClassname (-s) similarityClassname Name of distributed
similarity class to instantiate,
alternatively use
one of the predefined similarities
(SIMILARITY_COOCCURRENCE, SIMILARITY_EUCLIDEAN_DISTANCE,
SIMILARITY_LOGLIKELIHOOD, SIMILARITY_PEARSON_CORRELATION,
SIMILARITY_TANIMOTO_COEFFICIENT, SIMILARITY_UNCENTERED_COSINE,
SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE)
{code}
TODO: add more details
h3. Resources
* [Sarwar et al.:Item-Based Collaborative Filtering Recommendation Algorithms
|http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.9927&rep=rep1&type=pdf]
* [Slides: Distributed Itembased Collaborative Filtering with Apache
Mahout|http://www.slideshare.net/sscdotopen/mahoutcf]
Change your notification preferences:
https://cwiki.apache.org/confluence/users/viewnotifications.action