Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Recommender Documentation 
(https://cwiki.apache.org/confluence/display/MAHOUT/Recommender+Documentation)


Edited by Sean Owen:
---------------------------------------------------------------------
h2. Overview

_This documentation concerns the non-distributed, non-Hadoop-based recommender 
engine / collaborative filtering code inside Mahout. It was formerly a separate 
project called "Taste" and has continued development inside Mahout alongside 
other Hadoop-based code. It may be viewed as a somewhat separate, older, more 
comprehensive and more mature aspect of this code, compared to current 
development efforts focusing on Hadoop-based distributed recommenders. This 
remains the best entry point into Mahout recommender engines of all kinds._

A Mahout-based collaborative filtering engine takes users' preferences for 
items ("tastes") and returns estimated preferences for other items. For 
example, a site that sells books or CDs could easily use Mahout to figure out, 
from past purchase data, which CDs a customer might be interested in listening 
to.

Mahout provides a rich set of components from which you can construct a 
customized recommender system from a selection of algorithms. Taste is designed 
to be enterprise-ready; it's designed for performance, scalability and 
flexibility.

Mahout recommenders are not just for Java; it can be run as an external server 
which exposes recommendation logic to your application via web services and 
HTTP.

Top-level packages define the Taste interfaces to these key abstractions:
* DataModel
* UserSimilarity
* ItemSimilarity
* UserNeighborhood
* Recommender

Subpackages of {{org.apache.mahout.cf.taste.impl}} hold implementations of 
these interfaces. These are the pieces from which you will build your own 
recommendation engine. That's it! For the academically inclined, Mahout 
supports both *memory-based*, *item-based* recommender systems, *slope one* 
recommenders, and a couple other experimental implementations. It does not 
currently support *model-based* recommenders.

h2. Architecture

!https://cwiki.apache.org/confluence/download/attachments/22872433/taste-architecture.png!

This diagram shows the relationship between various Mahout components in a 
user-based recommender. An item-based recommender system is similar except that 
there are no PreferenceInferrers or Neighborhood algorithms involved.

h3. Recommender
A {{Recommender}} is the core abstraction in Mahout. Given a {{DataModel}}, it 
can produce recommendations. Applications will most likely use 
the{{GenericUserBasedRecommender}} implementation 
{{GenericItemBasedRecommender}}, possibly decorated by {{CachingRecommender}}.

h3. DataModel
A {{DataModel}} is the interface to information about user preferences. An 
implementation might draw this data from any source, but a database is the most 
likely source. Mahout provides {{MySQLJDBCDataModel}}, for example, to access 
preference data from a database via JDBC and MySQL. Another exists for 
PostgreSQL. Mahout also provides a {{FileDataModel}}.

There are no abstractions for a user or item in the object model (not anymore). 
Users and items are identified solely by an ID value in the framework. Further, 
this ID value must be numeric; it is a Java {{long}} type through the APIs. A 
{{Preference}} object or {{PreferenceArray}} object encapsulates the relation 
between user and preferred items (or items and users preferring them).

Finally, Mahout supports, in various ways, a so-called "boolean" data model in 
which users do not express preferences of varying strengths for items, but 
simply express an association or none at all. For example, while users might 
express a preference from 1 to 5 in the context of a movie recommender site, 
there may be no notion of a preference value between users and pages in the 
context of recommending pages on a web site: there is only a notion of an 
association, or none, between a user and pages that have been visited.

h3. UserSimilarity
A {{UserSimilarity}} defines a notion of similarity between two {{User</span>s. 
This is a crucial part of a recommendation engine. These are attached to a 
{{Neighborhood}} implementation. {{ItemSimilarity}}s are analagous, but find 
similarity between {{Item}}s.

h3. UserNeighborhood
In a user-based recommender, recommendations are produced by finding a 
"neighborhood" of similar users near a given user. A {{UserNeighborhood}} 
defines a means of determining that neighborhood &mdash; for example, nearest 
10 users. Implementations typically need a {{UserSimilarity}} to operate.

h2. Requirements
h3. Required

* [Java/ J2SE 6.0|http://www.java.com/getjava/index.jsp]

h3. Optional
* [Apache Maven|http://maven.apache.org]  2.2.1 or later, if you want to build 
from source or build examples. (Mac users note that even OS X 10.5 ships with 
Maven 2.0.6, which will not work.)
* Taste web applications require a [Servlet 
2.3+|http://java.sun.com/products/servlet/index.jsp] container, such as [Apache 
Tomcat|http://jakarta.apache.org/tomcat/]. It may in fact work with 
oldercontainers with slight modification.

h2. Demo

To build and run the demo, follow the instructions below, which are written for 
Unix-like operating systems:

* Obtain a copy of the Mahout distribution, either from SVN or as a downloaded 
archive.
* Download the "1 Million MovieLens Dataset" from 
[Grouplens.org|http://www.grouplens.org/]
* Unpack the archive and copy {{movies.dat}} and {{ratings.dat}} to 
{{trunk/taste-web/src/main/resources/org/apache/mahout/cf/taste/example/grouplens}}
 under the Mahout distribution directory.
* Navigate to the directory where you unpacked the Mahout distribution, and 
navigate to {{trunk}}.
* Run {{mvn install}}, which builds and installs Mahout core to your local 
repository
* {{cd taste-web}}
* {{cp ../examples/target/grouplens.jar ./lib}}
* Edit {{recommender.properties}} and fill in the {{recommender.class}}: 
{{recommender.class=org.apache.mahout.cf.taste.example.grouplens.GroupLensRecommender}}
* {{mvn package}}
* {{mvn jetty:run-war}}. You may need to give Maven more memory: in a bash 
shell, {{export MAVEN_OPTS=-Xmx1024M}}
* Get recommendations by accessing the web application in your browser: 
{{http://localhost:8080/RecommenderServlet?userID=1}} This will produce a 
simple preference-item ID list which could be consumed by a client application. 
Get more useful human-readable output with the {{debug}} parameter: 
{{http://localhost:8080/RecommenderServlet?userID=1&debug=true}} Incidentally, 
Taste's web service interface may then be found at: 
{{http://localhost:8080/RecommenderService.jws}}

Its WSDL file will be here...

{{http://localhost:8080/RecommenderService.jws?wsdl}}

... and you can even access it in your browser via a simple HTTP request:

{{.../RecommenderService.jws?method=recommend&amp;userID=1&howMany=10}}

*Note:* the exact URL where the service is deployed depends on how you deployed 
the application in your app server. For instance if you deployed it as a .war 
file called 'mahout-taste-webapp.war', it will deploy at a URI whose path 
begins with /mahout-taste-webapp/ instead.

h2. Examples
h3. User-based Recommender
User-based recommenders are the "original", conventional style of recommender 
system. They can produce good recommendations when tweaked properly; they are 
not necessarily the fastest recommender systems and are thus suitable for small 
data sets (roughly, less than ten million ratings). We'll start with an example 
of this.

First, create a {{DataModel}} of some kind. Here, we'll use a simple on based 
on data in a file. The file should be in CSV format, with lines of the form 
{{userID,itemID,prefValue}} (e.g. "39505,290002,3.5"):

{{DataModel model = new FileDataModel(new File("data.txt"));}}

We'll use the PearsonCorrelationSimilarity implementation of {{UserSimilarity}} 
as our user correlation algorithm, and add an optional preference inference 
algorithm:

{code}
UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(model);
// Optional:
userSimilarity.setPreferenceInferrer(new AveragingPreferenceInferrer());
{code}

Now we create a {{UserNeighborhood}} algorithm. Here we use nearest-3:

{code}
UserNeighborhood neighborhood =
          new NearestNUserNeighborhood(3, userSimilarity, model);{code}

Now we can create our{{Recommender}}, and add a caching decorator:

{code}
Recommender recommender =
          new GenericUserBasedRecommender(model, neighborhood, userSimilarity);
Recommender cachingRecommender = new CachingRecommender(recommender);
{code}

Now we can get 10 recommendations for user ID "1234" &mdash; done!
{code}
List<RecommendedItem> recommendations =
          cachingRecommender.recommend(1234, 10);
{code}

h3.Item-based Recommender

We could have created an item-based recommender instead. Item-based recommender 
base recommendation not on user similarity, but on item similarity. In theory 
these are about the same approach to the problem, just from different angles. 
However the similarity of two items is relatively fixed, more so than the 
similarity of two users. So, item-based recommenders can use pre-computed 
similarity values in the computations, which make them much faster. For large 
data sets, item-based recommenders are more appropriate.

Let's start over, again with a {{FileDataModel}} to start:

{code}
DataModel model = new FileDataModel(new File("data.txt"));
{code}

We'll also need an {{ItemSimilarity}}. We could use 
{{PearsonCorrelationSimilarity}}, which computes item similarity in realtime, 
but, this is generally too slow to be useful. Instead, in a real application, 
you would feed a list of pre-computed correlations to a 
{{GenericItemSimilarity}}: 

{code}
// Construct the list of pre-compted correlations
Collection<GenericItemSimilarity.ItemItemSimilarity> correlations =
          ...;
ItemSimilarity itemSimilarity =
          new GenericItemSimilarity(correlations);

{code}

Then we can finish as before to produce recommendations:

{code}
Recommender recommender =
          new GenericItemBasedRecommender(model, itemSimilarity);
Recommender cachingRecommender = new CachingRecommender(recommender);
...
List<RecommendedItem> recommendations =
          cachingRecommender.recommend(1234, 10);
{code}

h3. Slope-One Recommender
This is a simple yet effective {{Recommender}} and we present another example 
to round out the list:

{code}
DataModel model = new FileDataModel(new File("data.txt"));
          // Make a weighted slope one recommender
          Recommender recommender = new SlopeOneRecommender(model);
          Recommender cachingRecommender = new CachingRecommender(recommender);
        {code}


    
h2.Integration with your application
h3. Direct

You can create a {{Recommender}}, as shown above, wherever you like in your 
Java application, and use it. This includes simple Java applications or GUI 
applications, server applications, and J2EE web applications.

h3. Standalone server
A Mahout recommender can also be run as an external server, which may be the 
only option for non-Java applications. It can be exposed as a web application 
via{{org.apach.mahout.cf.taste.web.RecommenderServlet}}, and your application 
can then access recommendations via simple HTTP requests and response, or as a 
full-fledged SOAP web service. See above, and see {{the javadoc}} for details.

To deploy your {{Recommender}} as an external server:

* Obtain a copy of the Mahout distribution, either from SVN or as a downloaded 
archive.
* Create an implementation of 
{{org.apache.mahout.cf.taste.recommender.Recommender}} (must have a no-arg 
constructor).
* Compile it and create a JAR file containing your implementation.
* Navigate to the directory where you unpacked the Mahout distribution, and 
navigate to{{trunk}}.
* Run{{mvn install}}, which builds and installs Mahout core to your local 
repository
* {{cd taste-web}}
* Copy your .jar file: {{cp [your .jar file] ./lib}}
* Edit {{recommender.properties}} and fill in the {{recommender.class}} with 
your Recommender clas: {{recommender.class=[your recommender class]}}
* {{mvn package}}
* Your .war file is now available in the build directory as 
{{mahout-taste-webapp.war}} (which can be renamed).

h2. Performance
h3. Runtime Performance
The more data you give, the better. Though Mahout is designed for performance, 
you will undoubtedly run into performance issues at some point. For best 
results, consider using the following commad-line flags to your JVM:

* {{-server}}: Enables the server VM, which is generally appropriate for 
long-running, computation-intensive applications.
* {{-Xms1024m -Xmx1024m}}: Make the heap as big as possible -- a gigabyte 
doesn't hurt when dealing with tens millions of preferences. Taste will 
generally use as much memory as you give it for caching, which helps 
performance. Set the initial and max size to the same value to avoid wasting 
time growing the heap, and to avoid having the JVM run minor collections to 
avoid growing the heap, which will clear cached values.
* {{-da -dsa}}: Disable all assertions.
* {{-XX:NewRatio=9}}: Increase heap allocated to 'old' objects, which is most 
of them in this framework
* {{-XX:+UseParallelGC -XX:+UseParallelOldGC}} (multi-processor machines only): 
Use a GC algorithm designed to take advantage of multiple processors, and 
designed for throughput. This is a default in J2SE 5.0.
* {{-XX:-DisableExplicitGC}}: Disable calls to{{System.gc()}}. These calls can 
only hurt in the presence of modern GC algorithms; they may force Taste to 
remove cached data needlessly. This flag isn't needed if you're sure your code 
and third-party code you use doesn't call this method.

Also consider the following tips:

* Use {{CachingRecommender}} on top of your custom {{Recommender}} 
implementation.
* When using {{JDBCDataModel}}, make sure you've taken basic steps to optimize 
the table storing preference data. Create a primary key on the user ID and item 
ID columns, and an index on them. Set them to be non-null. And so on. Tune your 
database for lots of concurrent reads! When using JDBC, the database is almost 
always the bottleneck. Plenty of memory and caching are even more important.
* Also, pooling database connections is essential to performance. If using a 
J2EE container, it probably provides a way to configure connection pools. If 
you are creating your own {{DataSource}} directly, try wrapping it in 
{{org.apache.mahout.cf.taste.impl.model.jdbc.ConnectionPoolDataSource}}
* See MySQL-specific notes on performance in the javadoc for 
{{MySQLJDBCDataModel}}.

h3.Algorithm Performance: Which One Is Best?
There is no right answer; it depends on your data, your application, 
environment, and performance needs. Mahout provides the building blocks from 
which you can construct the best {{Recommender}} for your application. The 
links below provide research on this topic. You will probably need a bit of 
trial-and-error to find a setup that works best. The code sample above provides 
a good starting point.

Fortunately, Mahout provides a way to evaluate the accuracy of your 
{{Recommender}} on your own data, in{{org.apache.mahout.cf.taste.eval}}:

{code}
DataModel myModel = ...;
RecommenderBuilder builder = new RecommenderBuilder() {
  public Recommender buildRecommender(DataModel model) {
    // build and return the Recommender to evaluate here
  }
};

RecommenderEvaluator evaluator =
          new AverageAbsoluteDifferenceRecommenderEvaluator();
double evaluation = evaluator.evaluate(builder, myModel, 0.9, 1.0);
{code}

For "boolean" data model situations, where there are no notions of preference 
value, the above evaluation based on estimated preference does not make sense. 
In this case, try this kind of evaluation, which presents traditional 
information retrieval figures like precision and recall, which are more 
meaningful:

{code}
...
RecommenderIRStatsEvaluator evaluator =
        new GenericRecommenderIRStatsEvaluator();
IRStatistics stats =
        evaluator.evaluate(builder, myModel, null, 3,
RecommenderIRStatusEvaluator.CHOOSE_THRESHOLD,
        &sect;1.0);
{code}


h2. Useful Links
You'll want to look at these packages too, which offer more algorithms and 
approaches that you may find useful:

* [Cofi|http://www.nongnu.org/cofi/]: A Java-Based Collaborative Filtering 
Library
* [CoFE|http://eecs.oregonstate.edu/iis/CoFE/]

Here's a handful of research papers that I've read and found particularly 
useful:

J.S. Breese, D. Heckerman and C. Kadie, "[Empirical Analysis of Predictive 
Algorithms for Collaborative 
Filtering|http://research.microsoft.com/research/pubs/view.aspx?tr_id=166],"; in 
Proceedings of the Fourteenth Conference on Uncertainity in Artificial 
Intelligence (UAI 1998), 1998.

B. Sarwar, G. Karypis, J. Konstan and J. Riedl, "[Item-based collaborative 
filtering recommendation algorithms|http://www10.org/cdrom/papers/519/]"; in 
Proceedings of the Tenth International Conference on the World Wide Web (WWW 
10), pp. 285-295, 2001.

P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom and J. Riedl, "[GroupLens: an 
open architecture for collaborative filtering of 
netnews|http://doi.acm.org/10.1145/192844.192905]"; in Proceedings of the 1994 
ACM conference on Computer Supported Cooperative Work (CSCW 1994), pp. 175-186, 
1994.

J.L. Herlocker, J.A. Konstan, A. Borchers and J. Riedl, "[An algorithmic 
framework for performing collaborative 
filtering|http://www.grouplens.org/papers/pdf/algs.pdf]"; in Proceedings of the 
22nd annual international ACM SIGIR Conference on Research and Development in 
Information Retrieval (SIGIR 99), pp. 230-237, 1999.

Clifford Lyon, "[Movie 
Recommender|http://materialobjects.com/cf/MovieRecommender.pdf]"; CSCI E-280 
final project, Harvard University, 2004.

Daniel Lemire, Anna Maclachlan, "[Slope One Predictors for Online Rating-Based 
Collaborative 
Filtering|http://www.daniel-lemire.com/fr/abstracts/SDM2005.html],"; Proceedings 
of SIAM Data Mining (SDM '05), 2005.

Michelle Anderson, Marcel Ball, Harold Boley, Stephen Greene, Nancy Howse, 
Daniel Lemire and Sean McGrath, "[RACOFI: A Rule-Applying Collaborative 
Filtering 
System|http://www.daniel-lemire.com/fr/documents/publications/racofi_nrc.pdf]",";
 Proceedings of COLA '03, 2003.

These links will take you to all the collaborative filtering reading you could 
ever want!
* [Paul Perry's notes|http://www.paulperry.net/notes/cf.asp]
* [James Thornton's collaborative filtering 
resources|http://jamesthornton.com/cf/]
* [Daniel Lemire's blog|http://www.daniel-lemire.com/blog/] which frequently 
covers collaborative filtering topics


Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action    

Reply via email to