No, I don't think it's anything to do with NaN. The result and implementation is quite by design.
I really don't understand this talk of "non standard" Pearson correlation. On the contrary, the implementation is quite strictly a Pearson correlation. The request seems to be to "fix" the computation to, say, compute a Pearson correlation on series like (1,2) and (3,6,1,2). This isn't even well-formed -- the series aren't of the same size. It makes sense if you want to pad the series to be of equal size. That's a good question. But it's not a question of how the Pearson correlation is defined or implemented, but of how the data fed into it is "defined". And I'm saying it's a valid variant, one implemented already. On Wed, Apr 6, 2011 at 5:11 PM, Daniel McEnnis <[email protected]> wrote: > Alejandro, > > The difficulty lies in that values that are normally zero are in fact > Double.NaN. Including these extra values to get a correct result > means, invariably, ending up with Double.NaN as a result. To avoid > this, Mahout uses non-standard implementations that only considers > co-occurrence entries in the result. Whether these distance metrics > should be called the same as their non-recommender cousins is a > question for debate.... > > Daniel. > > On Wed, Apr 6, 2011 at 12:00 PM, Alejandro Bellogin Kouki > <[email protected]> wrote: >> Hi, >> >> maybe I didn't express myself correctly... I'm talking about the calculation >> of user's or item's mean (R_i in Sarwar's paper), which should be computed >> using ALL the items of that user/item, BUT in Mahout it is computed using >> only the items corated by both users/items. >> >> This causes strange effects, for instance, if we have two users with two >> items in common, and other unknown ratings: >> i1 i2 i3 i4 >> u1 4 4 -- 5 >> u2 3 3 5 -- >> >> the current code in Mahout computes the mean of u1 as 4, and of u2 as 3, >> which leads to a 0 when it is used for centering the data, instead of 4.3 >> and 3.6, resp. >> >> I hope it is more clear now. >> >> Alejandro >> >> Sebastian Schelter escribió: >>> >>> IIRC Sarwar et.al.'s "Item-Based Collaborative Filtering Recommendation >>> Algorithms" explicitly mentions to only use the co-rated cases for Pearson >>> correlation. >>> >>> --sebastian >>> >>> On 06.04.2011 17:33, Sean Owen wrote: >>>> >>>> It's a good question. >>>> >>>> The Pearson correlation of two series does not change if the series >>>> means change. That is, subtracting the same value from all elements of >>>> one series (or scaling the values) doesn't change the correlation. In >>>> that sense, I would not say you must center the series to make either >>>> one's mean 0. It wouldn't make a difference, no matter what number you >>>> subtracted, even if it were the mean of all ratings by the user. >>>> >>>> The code you see in the project *does* center the data, because *if* >>>> the means are 0, then the computation result is the same as the cosine >>>> measure, and that seems nice. (There's also an uncentered cosine >>>> measure version.) >>>> >>>> >>>> What I think you're really getting at is, can't we expand the series >>>> to include all items that either one or the other user rated? Then the >>>> question is, what are the missing values you want to fill in? There's >>>> not a great answer to that, since any answer is artificial, but >>>> picking the user's mean rating is a decent choice. This is not quite >>>> the same as centering. >>>> >>>> You can do that in Mahout -- use AveragingPreferenceInferrer to do >>>> exactly this with these similarity metrics. It will slow things down >>>> and anecdotally I don't think it's worth it, but it's certainly there. >>>> >>>> I don't think the normal version, without a PreferenceInferrer, is >>>> "wrong". It is just implementing the Pearson correlation on all data >>>> available, and you have to add a setting to tell it to make up data. >>>> >>>> >>>> >>>> On Wed, Apr 6, 2011 at 3:13 PM, Alejandro Bellogin Kouki >>>> <[email protected]> wrote: >>>>> >>>>> Hi all, >>>>> >>>>> I've been using Mahout for many years now, mainly for my Master's >>>>> thesis, >>>>> and now for my PhD thesis. That is why, first, I want to congratulate >>>>> you >>>>> for the effort of putting such a library as open source. >>>>> >>>>> At this point, my main concern is recommendation, and, because of that, >>>>> I >>>>> have been using the different recommenders, evaluators and similarities >>>>> implemented in the library. However, today, after many times inspecting >>>>> your >>>>> code, I have found, IMHO, a relevant bug with further implications. >>>>> >>>>> It is related with the computation of the similarity. Although this is >>>>> not >>>>> the only implemented similarity, Pearson's correlation is one of the >>>>> most >>>>> popular one. This similarity requires to normalise (or "center") the >>>>> data >>>>> using the user's mean, in order to be able to distinguish a user who >>>>> usually >>>>> rates items with 5's from a user who usually rates them with 3's, even >>>>> though in a particular item both rated it with a 5. The problem is that >>>>> the >>>>> user's means are being calculated using ONLY the items in common between >>>>> the >>>>> two users, leading to strange similarity computations (or worse, to no >>>>> similarity at all!). It is not difficult to find small examples showing >>>>> this >>>>> behaviour, besides, seminal papers assume the overall mean rating is >>>>> used >>>>> [1, 2]. >>>>> >>>>> Since I am a newbie on this patch and bug/fix terminology, I would like >>>>> to >>>>> know what is the best (or the only?) way of including this finding. I >>>>> have >>>>> to say that I already have fixed the code (it affects to the >>>>> AbstractSimilarity class, and therefore, it would have an impact on >>>>> other >>>>> similarity functions too). >>>>> >>>>> Best regards, >>>>> Alejandro >>>>> >>>>> [1] M. J. Pazzani: "A framework for collaborative, content-based and >>>>> demographic filtering". Artificial Intelligence Review 13, pp. 393-408. >>>>> 1999 >>>>> [2] C. Desrosiers, G. Karypis: "A comprehensive survey of >>>>> neighborhood-based >>>>> recommendation methods". Recommender Systems Handbook, chapter 4. 2010 >>>>> >>>>> -- >>>>> Alejandro Bellogin Kouki >>>>> http://rincon.uam.es/dir?cw=435275268554687 >>>>> >>>>> >>> >> >> -- >> Alejandro Bellogin Kouki >> http://rincon.uam.es/dir?cw=435275268554687 >> >> >
