As a small follow-up on this, here's a small result that should hold -- Setting the sampling rate to, say, 1/X (i.e. if you set it to 20%, X=5), should reduce the time spent in finding a neighborhood by a factor of X. Of course. Assuming users are pretty evenly scattered around your rating-space, the average distance to users in your computed neighborhood also increases by a factor of X.
So you get results X times faster, but the results you get are X times 'worse'. This sounds bad but consider that users 5 times farther away in your rating-space may still be suitable neighbors and yield the same recommendations. On Fri, May 1, 2009 at 8:32 AM, Sean Owen <[email protected]> wrote: > It really depends on the nature of the data and what tradeoff you want > to make. I have not studied this in detail. Anecdotally, on a > large-ish data set you can ignore most users and still end up with an > OK neighborhood. > > Actually I should do a bit of math to get an analytical result on > this, let me do that.
