I'm working on a demo on Mahout and part of it is on collab.
filtering. For the CF part, I'm taking the lead from an idea from
Ted about a way to demonstrate how CF works conceptually. (Ted please
correct me if my understanding is incorrect)
I took a subset of Wikipedia articles (2302, available at http://people.apache.org/~gsingers/wikipedia/chunks.tar.gz
, created by the WikipediaXMLSplitter in the example directory).
Next, I picked a topic of interest, in this case all docs containing
the phrase "Abraham Lincoln", and I made the assumption that there are
10 users out of a total of 1000 who are "Lincolnphiles" and have
thereby rated most of the articles (17 total) on the topic. The
ratings range between -5 and 5 (as doubles), but for the most part,
the Lincolnphiles tend to like the same things, but to varying
degrees. (Note, I did these ratings by hand and thus "stacked the
deck") The Lincolnphiles are really obsessed and did not rate any
other documents. However, not all of them rated all 17 articles.
Next, I assumed the other 990 users are randomly rating across all the
documents and in the same range. Thus, for every article in the set,
I randomly grabbed X users and then have them randomly assign a degree
of like or dislike in the range mentioned.
I then implemented a basic recommender according to the Taste docs
under User-based recommenders section. I then pass in the user id of
one of the Lincolnphiles. The results I get back are a bit surprising
in that none of the recommendations are for other items rated highly
by the Lincolnphiles, despite the fact that, when setting the
neighborhood to be 10, all of the other Lincolnphiles are in the
neighborhood plus one non-Lincolnphile. I would expect the
recommendations to be for items that are not rated by my Lincolnphile,
but have been rated by the other Lincolnphiles, or at least some of
them, but in fact none of the recommendations are for Lincoln docs.
OK, so I then played around a bit with the neighborhood size. If I
make it 9 (which is the number of other Lincolnphiles in the system)
or less, I then get what I expected. So, it seems the one non-
Lincolnphile rated a lot more items than all the Lincolnphiles. Is
that why that user's items seem to dominate the recommendations? In
looking at the non-Lincoln user, I see two common items that they both
rated, one that they both really liked and one that they disagreed on.
I'm not exactly sure what my questions are, other than the one about
an active user dominating like minded, but less active raters and
what's the appropriate thing to do there, if anything, but I wanted to
make sure this all makes sense.
Also, is there any notion in Taste similar to Lucene's explain method (http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/search/Searcher.html#explain(org.apache.lucene.search.Query,%20int)
)?
After this sanity check, my next goal is to show how a new
Lincolnphile coming into the system would be guided to other content
on Lincoln.
[And yes, once done, this code will be publicly available, but it will
be a little while]
Here's my snippet of code for recommending, pretty much verbatim from
the Taste docs:
UserSimilarity userSimilarity = new
PearsonCorrelationSimilarity(dataModel);
// Optional:
userSimilarity.setPreferenceInferrer(new
AveragingPreferenceInferrer(dataModel));
UserNeighborhood neighborhood =
new NearestNUserNeighborhood(neighSize, userSimilarity,
dataModel);
Collection<User> users = neighborhood.getUserNeighborhood(userId);
for (User neighbor : users) {
System.out.println("Neighbor: " + neighbor);
}
Recommender recommender =
new GenericUserBasedRecommender(dataModel, neighborhood,
userSimilarity);
Recommender cachingRecommender = new
CachingRecommender(recommender);
List<RecommendedItem> recommendations =
cachingRecommender.recommend(userId, 10);
System.out.println("Recommendations:");
for (RecommendedItem item : recommendations) {
Item theItem = item.getItem();
String title = idsToTitle.get(theItem.getID().toString());
System.out.println("Doc Id: " + theItem + " Title: " + title);
}
Cheers,
Grant