My first reaction is that this could make sense, if my guessed below are accurate.
The 'philes show up as very similar, good. Since they have all rated mostly the same things, they don't contribute much to the possible recommendations for each other. If they have all rated all but one Lincoln article then they will generate at best one new recommendation for each other! When the normal user sneaks in, suddenly there are at least more possible items to recommend. Naturally, they round out the list. You ask for 10 recs, and you get the 1 good rec plus the next 9 best recs, all basically from that user. That is I imagine the estimated pref value of all the other 9 is notably lower. In practice therefore you might choose to chop off recs whose estimated pref is below a certain value, or maybe truncate the list when the estimate from one to the next drops significantly. Am I right? If so, one response is that this funky behavior is mostly a function of the particular data you have constructed. On Jun 18, 2009 4:30 PM, "Grant Ingersoll" <[email protected]> wrote: I'm working on a demo on Mahout and part of it is on collab. filtering. For the CF part, I'm taking the lead from an idea from Ted about a way to demonstrate how CF works conceptually. (Ted please correct me if my understanding is incorrect) I took a subset of Wikipedia articles (2302, available at http://people.apache.org/~gsingers/wikipedia/chunks.tar.gz, created by the WikipediaXMLSplitter in the example directory). Next, I picked a topic of interest, in this case all docs containing the phrase "Abraham Lincoln", and I made the assumption that there are 10 users out of a total of 1000 who are "Lincolnphiles" and have thereby rated most of the articles (17 total) on the topic. The ratings range between -5 and 5 (as doubles), but for the most part, the Lincolnphiles tend to like the same things, but to varying degrees. (Note, I did these ratings by hand and thus "stacked the deck") The Lincolnphiles are really obsessed and did not rate any other documents. However, not all of them rated all 17 articles. Next, I assumed the other 990 users are randomly rating across all the documents and in the same range. Thus, for every article in the set, I randomly grabbed X users and then have them randomly assign a degree of like or dislike in the range mentioned. I then implemented a basic recommender according to the Taste docs under User-based recommenders section. I then pass in the user id of one of the Lincolnphiles. The results I get back are a bit surprising in that none of the recommendations are for other items rated highly by the Lincolnphiles, despite the fact that, when setting the neighborhood to be 10, all of the other Lincolnphiles are in the neighborhood plus one non-Lincolnphile. I would expect the recommendations to be for items that are not rated by my Lincolnphile, but have been rated by the other Lincolnphiles, or at least some of them, but in fact none of the recommendations are for Lincoln docs. OK, so I then played around a bit with the neighborhood size. If I make it 9 (which is the number of other Lincolnphiles in the system) or less, I then get what I expected. So, it seems the one non-Lincolnphile rated a lot more items than all the Lincolnphiles. Is that why that user's items seem to dominate the recommendations? In looking at the non-Lincoln user, I see two common items that they both rated, one that they both really liked and one that they disagreed on. I'm not exactly sure what my questions are, other than the one about an active user dominating like minded, but less active raters and what's the appropriate thing to do there, if anything, but I wanted to make sure this all makes sense. Also, is there any notion in Taste similar to Lucene's explain method ( http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/search/Searcher.html#explain(org.apache.lucene.search.Query,%20int) )? After this sanity check, my next goal is to show how a new Lincolnphile coming into the system would be guided to other content on Lincoln. [And yes, once done, this code will be publicly available, but it will be a little while] Here's my snippet of code for recommending, pretty much verbatim from the Taste docs: UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(dataModel); // Optional: userSimilarity.setPreferenceInferrer(new AveragingPreferenceInferrer(dataModel)); UserNeighborhood neighborhood = new NearestNUserNeighborhood(neighSize, userSimilarity, dataModel); Collection<User> users = neighborhood.getUserNeighborhood(userId); for (User neighbor : users) { System.out.println("Neighbor: " + neighbor); } Recommender recommender = new GenericUserBasedRecommender(dataModel, neighborhood, userSimilarity); Recommender cachingRecommender = new CachingRecommender(recommender); List<RecommendedItem> recommendations = cachingRecommender.recommend(userId, 10); System.out.println("Recommendations:"); for (RecommendedItem item : recommendations) { Item theItem = item.getItem(); String title = idsToTitle.get(theItem.getID().toString()); System.out.println("Doc Id: " + theItem + " Title: " + title); } Cheers, Grant
