Alejandro, The difficulty lies in that values that are normally zero are in fact Double.NaN. Including these extra values to get a correct result means, invariably, ending up with Double.NaN as a result. To avoid this, Mahout uses non-standard implementations that only considers co-occurrence entries in the result. Whether these distance metrics should be called the same as their non-recommender cousins is a question for debate....
Daniel. On Wed, Apr 6, 2011 at 12:00 PM, Alejandro Bellogin Kouki <[email protected]> wrote: > Hi, > > maybe I didn't express myself correctly... I'm talking about the calculation > of user's or item's mean (R_i in Sarwar's paper), which should be computed > using ALL the items of that user/item, BUT in Mahout it is computed using > only the items corated by both users/items. > > This causes strange effects, for instance, if we have two users with two > items in common, and other unknown ratings: > i1 i2 i3 i4 > u1 4 4 -- 5 > u2 3 3 5 -- > > the current code in Mahout computes the mean of u1 as 4, and of u2 as 3, > which leads to a 0 when it is used for centering the data, instead of 4.3 > and 3.6, resp. > > I hope it is more clear now. > > Alejandro > > Sebastian Schelter escribió: >> >> IIRC Sarwar et.al.'s "Item-Based Collaborative Filtering Recommendation >> Algorithms" explicitly mentions to only use the co-rated cases for Pearson >> correlation. >> >> --sebastian >> >> On 06.04.2011 17:33, Sean Owen wrote: >>> >>> It's a good question. >>> >>> The Pearson correlation of two series does not change if the series >>> means change. That is, subtracting the same value from all elements of >>> one series (or scaling the values) doesn't change the correlation. In >>> that sense, I would not say you must center the series to make either >>> one's mean 0. It wouldn't make a difference, no matter what number you >>> subtracted, even if it were the mean of all ratings by the user. >>> >>> The code you see in the project *does* center the data, because *if* >>> the means are 0, then the computation result is the same as the cosine >>> measure, and that seems nice. (There's also an uncentered cosine >>> measure version.) >>> >>> >>> What I think you're really getting at is, can't we expand the series >>> to include all items that either one or the other user rated? Then the >>> question is, what are the missing values you want to fill in? There's >>> not a great answer to that, since any answer is artificial, but >>> picking the user's mean rating is a decent choice. This is not quite >>> the same as centering. >>> >>> You can do that in Mahout -- use AveragingPreferenceInferrer to do >>> exactly this with these similarity metrics. It will slow things down >>> and anecdotally I don't think it's worth it, but it's certainly there. >>> >>> I don't think the normal version, without a PreferenceInferrer, is >>> "wrong". It is just implementing the Pearson correlation on all data >>> available, and you have to add a setting to tell it to make up data. >>> >>> >>> >>> On Wed, Apr 6, 2011 at 3:13 PM, Alejandro Bellogin Kouki >>> <[email protected]> wrote: >>>> >>>> Hi all, >>>> >>>> I've been using Mahout for many years now, mainly for my Master's >>>> thesis, >>>> and now for my PhD thesis. That is why, first, I want to congratulate >>>> you >>>> for the effort of putting such a library as open source. >>>> >>>> At this point, my main concern is recommendation, and, because of that, >>>> I >>>> have been using the different recommenders, evaluators and similarities >>>> implemented in the library. However, today, after many times inspecting >>>> your >>>> code, I have found, IMHO, a relevant bug with further implications. >>>> >>>> It is related with the computation of the similarity. Although this is >>>> not >>>> the only implemented similarity, Pearson's correlation is one of the >>>> most >>>> popular one. This similarity requires to normalise (or "center") the >>>> data >>>> using the user's mean, in order to be able to distinguish a user who >>>> usually >>>> rates items with 5's from a user who usually rates them with 3's, even >>>> though in a particular item both rated it with a 5. The problem is that >>>> the >>>> user's means are being calculated using ONLY the items in common between >>>> the >>>> two users, leading to strange similarity computations (or worse, to no >>>> similarity at all!). It is not difficult to find small examples showing >>>> this >>>> behaviour, besides, seminal papers assume the overall mean rating is >>>> used >>>> [1, 2]. >>>> >>>> Since I am a newbie on this patch and bug/fix terminology, I would like >>>> to >>>> know what is the best (or the only?) way of including this finding. I >>>> have >>>> to say that I already have fixed the code (it affects to the >>>> AbstractSimilarity class, and therefore, it would have an impact on >>>> other >>>> similarity functions too). >>>> >>>> Best regards, >>>> Alejandro >>>> >>>> [1] M. J. Pazzani: "A framework for collaborative, content-based and >>>> demographic filtering". Artificial Intelligence Review 13, pp. 393-408. >>>> 1999 >>>> [2] C. Desrosiers, G. Karypis: "A comprehensive survey of >>>> neighborhood-based >>>> recommendation methods". Recommender Systems Handbook, chapter 4. 2010 >>>> >>>> -- >>>> Alejandro Bellogin Kouki >>>> http://rincon.uam.es/dir?cw=435275268554687 >>>> >>>> >> > > -- > Alejandro Bellogin Kouki > http://rincon.uam.es/dir?cw=435275268554687 > >
