Off-line evaluation of recommendations is a really difficult problem. The problem is one of reject-inference, that is, your test data was sampled using one recommendation engine which biases all of your data in favor of engines like that one. A very different engine might produce very different recommendations that would be much better. This is especially true in cases of binary input (the only important case for most applications).
The better evaluation method is to run multiple parametrized versions of recommendations and do efficient parameter search to find what the better engines are. This has to include the UI in the parameterizations because it can have such a huge effect. Unfortunately, this isn't feasible for wild new approaches where you have thousands or millions of potential engine configurations. It is still the bread and butter of evaluation for real systems. I have found that simple inspection suffices to do rough cut evaluation and automated multi-variate testing is the way to judge fine grained distinctions. As a measure of how big an effect the UI can have, I recently had a system that gave 10 results per page. Here are some (unscaled) click rates depending on search rank: Rank Click Rate 0 853 1 415 2 238 3 184 4 170 5 167 6 133 7 125 8 121 9 150 10 0 11 2 12 2 13 2 14 2 15 4 16 2 17 0 18 0 19 0 20 3 The extraordinary thing about these results is that apart from the first three or so, the click rate is essentially constant down to the 10th result. Then it is clear that *nobody* is clicking to the next page. Based on this, I would expect as much as a 50% increase in total clicks just by presenting 20 results. I have almost NEVER seen algorithmic changes that would make such a large difference. On Tue, Jul 28, 2009 at 7:10 AM, Claudia Grieco <[email protected]>wrote: > Thanks a lot :) I was wondering what those IR classes were for XD > > -----Messaggio originale----- > Da: Sean Owen [mailto:[email protected]] > Inviato: martedì 28 luglio 2009 15.52 > A: [email protected] > Oggetto: Re: The best evaluator for recommendations in binary data sets > > No, really those types of evaluation do not apply to your case. Those > evaluate how closely the estimated preference values match real ones. > But in your case, you have no preference values (or they're implicitly > all '1.0' or something) so this is meaningless. > > What you are likely interested in is something related but different > -- statistics like precision and recall. That is you are concerned > with whether the recommender will recommend a lot of items the user > would be associated to. For example maybe you take away three of the > user's items and see if the recommender recommends 3 of them back. > > Look at GenericRecommenderIRStatsEvaluator instead. It can compute > precision and recall figures, which is more what you want. > > On Tue, Jul 28, 2009 at 2:35 PM, Claudia Grieco<[email protected]> > wrote: > > Hi guys, > > > > I have created an user based recommender which operates on a binary data > > set (an user has bought or not bought a product) > > > > I'm using BooleanTanimoto Coefficient, BooleanUserGenericUserBased and so > > on. > > > > Is using AverageAbsoluteDifferenceRecommenderEvaluator to evaluate the > > recommender a good idea? > > > > > > -- Ted Dunning, CTO DeepDyve
