What about defining a small prior similarity value between all items? In this case, the Lincoln book and the cookbook would start with some small similarity like 0.01 and as more users connect the books, this value gets swamped with the true value. It's the same concept that's used in Beta or Dirichlet distributions.
Anyway, in the case of this algorithm, if there is no user data between Lincoln books and the cookbook, then the resulting preference would just be the average of all the user's previous ratings. If there is some week similarity, say 0.1 with a rating of 5, then you'd skew the resulting preference score higher, but it won't go all the way to 5.0. How much it skews is controlled by the strength of prior relative to the similarity from the data. -Mark On Thu, Aug 20, 2009 at 11:20 AM, Sean Owen <[email protected]> wrote: > To kind of follow up on this, since Grant and I talked at length about > this today -- > > Yeah in the end he was able to create some basic examples that gave > reasonable results on this data set. Except for approaches based on > item-based recommenders. This data set highlighted a weakness of a > straightforward reading of the algorithm, and it might be interesting > to explain here for any interested parties. > > In this data set it's clear what the 'right' answers are -- there are > a small group of fans of books about Lincoln and one expects more > Lincoln books to be recommended. Indeed the algorithms estimated high > ratings for such books on a scale of 1 to 5. But, a number of > seemingly unrelated books were topping the recommendation list with an > estimated rating of 5. > > Why? well the core of the item-based recommender is the bit that > estimates a probable rating for some new item under consideration: > > estimatePreference for item: > for each preferredItem that the user expresses a preference for: > compute similarity metric m between item and preferredItem > if it is defined: > store the preference value for preferredItem, weighted by > (multiplied by) similarity m > return the weighted average over those stored preference values > > Simple enough. But it's vulnerable to a not-uncommon scenario. Imagine > a user likes only Lincoln-related books. In the course of producing > recommendations, we estimate a preference value for, say, a cookbook. > Should be low right? This is a user who rates Lincoln books, just > about exclusively, and maybe rated a recent popular Lincoln biography > 5 out of 5. > > But because there are few users in the world who both cook and study > history, it's possible that given the data, there is no definable > similarity value between any of the Lincoln books this user likes, and > the cookbook. Fine, in this case the cookbook won't be recommended. > > But imagine that, as it happens, a couple people like the cookbook and > this recent Lincoln biography. We compute the similarity and still > find it's low. Maybe 0.1. Still seems like this isn't a good > recommendation right? > > Well, if that's the only Lincoln book in the preferences that has any > defined similarity to the cookbook, the above algorithm says we should > return the weighted average of those preferences, weighted by > similarity. That's: (0.1 * 5.0) / 0.1. Or 5.0. The process says the > estimated rating is 5, and indeed there is some justification for > that. But it flies in the face of intuition. > > > Interesting. Takeaways? > > 1) The stock item-based recommender just has this weakness. I am open > to suggestion about alternate modes that could work around this. For > example: reject as a recommendation any item like this, where the > estimate would be based on a single data point. > > 2) Once again it seems that using preference values can actually get > one into trouble. Interesting. > > > On Thu, Jun 18, 2009 at 9:43 PM, Ted Dunning<[email protected]> wrote: > > Grant, > > > > The data you described is pretty simple and should produce good results > at > > all levels of overlap. That it does not is definitely a problem. In > fact, > > I would recommend making the data harder to deal with by giving > non-Lincoln > > items highly variable popularities and then making the groundlings rate > > items according to their popularity. This will result in an apparent > > pattern where the inclusion of any number of non-lincoln fans will show > an > > apparent pattern of liking popular items. The correct inference should, > > however, be that any neighbor group that has a large number of Lincoln > fans > > seems to like popular items less than expected. > > > > For problems like this, I have had good luck with using measures that > were > > robust in the face of noise (aka small counts) and in the face of large > > external trends (aka the top-40 problem). The simplest one that I know > of > > is the generalized multinomial log-likelihood > > ratio<http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html > >that > > you hear me nattering about so often. LSI does a decent job of > > dealing > > with the top-40, but has trouble with small counts. LDA and related > > probabiliistic methods should work somewhat better than log-likelihood > > ratio, but are considerably more complex to implement. > > > > The key here is to compare counts within the local neighborhood to counts > > outside the neighborhood. Things that are significantly different about > the > > neighborhood relative to rest of the world are candidates for > > recommendation. Things to avoid when looking for interesting differences > > include: > > > > - correlation measures such as Pearson's R (based on normal distribution > > approximation and unscaled thus suffers from both small count and top-40 > > problems) > > > > - anomaly measures based simply on frequency ratios (very sensitive to > small > > count problems, doesn't account for top-40 at all) > > > > What I would recommend for a nearest neighbor approach is to continue > with > > the current neighbor retrieval, but switch to a log-likelihood ratio for > > generating recommendations. > > > > What I would recommend for a production system would be to scrap the > nearest > > neighbor approach entirely and go to a coocurrence matrix based approach. > > This costs much less to compute at recommendation and is very robust > against > > both small counts and top-40 issues. > > > > On Thu, Jun 18, 2009 at 9:37 AM, Sean Owen <[email protected]> wrote: > > > >> Still seems a little funny. I am away from the code otherwise I would > check > >> - forget if I ever implemented weighting in the standard correlation > >> similarity metrics. A pure correlation does not account for whether you > >> overlapped in 10 or 1000 items. This sort of weighting exists elsewhere > but > >> I forget about here. It might help. > >> > > > > > > > > -- > > Ted Dunning, CTO > > DeepDyve > > >
