Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Recommender First-Timer FAQ 
(https://cwiki.apache.org/confluence/display/MAHOUT/Recommender+First-Timer+FAQ)

Added by Sean Owen:
---------------------------------------------------------------------
Many people with an interest in recommenders arrive at Mahout since they're 
building a first recommender system. Some starting questions have been asked 
enough times to warrant a FAQ collecting advice and rules-of-thumb to newcomers.

For the interested, these topics are treated in detail in the book [Mahout in 
Action|http://manning.com/owen/].

Don't start with a distributed, Hadoop-based recommender; take on that 
complexity only if necessary. Start with non-distributed recommenders. It is 
simpler, has fewer requirements, and is more flexible. 

As a crude rule of thumb, a system with up to 100M user-item associations 
(ratings, preferences) should "fit" onto one modern server machine with 4GB of 
heap available and run acceptably as a real-time recommender. The system is 
invariably memory-bound since keeping data in memory is essential to 
performance.

Beyond this point it gets expensive to deploy a machine with enough RAM, so, 
designing for a distributed makes sense when nearing this scale. However most 
applications don't "really" have 100M associations to process. Data can be 
sampled; noisy and old data can often be aggressively pruned without 
significant impact on the result.

The next question is whether or not your system has preference values, or 
ratings. Do users and items merely have an association or not, such as the 
existence or lack of a click? or is behavior translated into some scalar value 
representing the user's degree of preference for the item.

If you have ratings, then a good place to start is a 
GenericItemBasedRecommender, plus a PearsonCorrelationSimilarity similarity 
metric. If you don't have ratings, then a good place to start is 
GenericBooleanPrefItemBasedRecommender and LogLikelihoodSimilarity.

If you want to do content-based item-item similarity, you need to implement 
your own ItemSimilarity.

If your data can be simply exported to a CSV file, use FileDataModel and push 
new files periodically.
If your data is in a database, use MySQLJDBCDataModel (or its "BooleanPref" 
counterpart if appropriate, or its PostgreSQL counterpart, etc.) and put on top 
a ReloadFromJDBCDataModel.

This should give a reasonable starter system which responds fast. The nature of 
the system is that new data comes in from the file or database only 
periodically -- perhaps on the order of minutes. If that's not OK, you'll have 
to look into some more specialized work -- SlopeOneRecommender deals with 
updates quickly, or, it is possible to do some work to update the 
GenericDataModel in real time. 


Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action

Reply via email to