Well mathematically, right now, by omitting the unknown similarities, you are equivalently setting their similarities to zero and thus they drop off. Unless there's some other calculation I don't know about that....
If you want a more theoretically sound version of what I'm trying to do (instead of my hacked up way), see section 2.2.1 in: http://research.microsoft.com/pubs/69656/tr-98-12.pdf And yes, if you have a default similarity, you will change the result for different entries. In the above example say there is another book, maybe about fishing that we also want to estimate the preference for. Say this book has a similarity of 0.1 with the Lincoln book and 0.1 with the france book. Without the prior, your estimate is: score(fishing) = (0.1*5 + 0.1*3) / (0.1 + 0.1) = 4 With the prior, it's: score(fishing) = (0.1*5 + 0.1*3 + 0.01*1) / (0.1 + 0.1 +0.01) = 3.85 This is 96% of the original value, whereas for the cookbook example, it's 90%. They aren't moving in lockstep because the impact of the prior is different depending on which entries we have real data for. -Mark On Fri, Aug 21, 2009 at 5:02 AM, Sean Owen <[email protected]> wrote: > The only piece of what you're saying that sort of doesn't click with > me is assuming a prior of 0, or some small value. Similarities range > from -1 to 1, so a value of 0 is a positive statement that there is > exactly no relationship between the ratings for both items. This is > different than having no opinion on it and it matters to the > subsequent calculations. > > But does your second suggest really meaningfully change the result... > yeah I push the rating down from 5 to 4.5, but throwing in these other > small terms in the average does roughly the same to all the values? I > perhaps haven't thought this through. I understand the intuitions here > and think they are right. > > > One meta-issue here for the library is this: there's a 'standard' > item-based recommender algorithm out there, and we want to have that. > And we do. So I don't want to touch it -- perhaps add some options to > modify its behavior. So we're talking about maybe inventing a variant > algorithm... or three or four. That's good I guess, though not exactly > the remit I had in mind for the CF part. I was imagining it would > provide access to canonical approaches with perhaps some small > variants tacked on, or at least hooks to modify parts of the logic. > > Basically I also need to have a think about how to include variants > like this in a logical way. > > > On Thu, Aug 20, 2009 at 6:07 PM, Mark Desnoyer<[email protected]> wrote: > > You could do it that way, but I don't think you're restricted to ignoring > > the rating values. For example, you could define similarity between item > i > > and item j like (the normalization is probably incomplete, but this is > the > > idea): > > > > similarity(i,j) = (prior + sum({x_ij})) / (count({x_ij}) + 1) > > > > where each x_ij is the similarity defined by a single user and could be > > based on their ratings. So I think the way you're thinking of it, x_ij = > 1, > > but it could be a function of the ratings, say higher if the ratings are > > closer and lower if they are far apart. > > > > You can still do the weighted average, you just have more items to > > calculate. Say a user has rated the Liconln book 5 a book on france 3 and > a > > book on space travel 1. Assuming there is no data linking the france or > > space books to the cookbook, then their similarities would be the prior, > or > > 0.01. Then, you'd calculate the score for the cookbook recommendation as: > > > > score(cookbook) = 5*0.1 + 3*0.01 + 1*0.01 / (0.1 + 0.01 + 0.01) = 4.5 >
