Hi all,

I've been using Mahout for many years now, mainly for my Master's thesis, and now for my PhD thesis. That is why, first, I want to congratulate you for the effort of putting such a library as open source.

At this point, my main concern is recommendation, and, because of that, I have been using the different recommenders, evaluators and similarities implemented in the library. However, today, after many times inspecting your code, I have found, IMHO, a relevant bug with further implications.

It is related with the computation of the similarity. Although this is not the only implemented similarity, Pearson's correlation is one of the most popular one. This similarity requires to normalise (or "center") the data using the user's mean, in order to be able to distinguish a user who usually rates items with 5's from a user who usually rates them with 3's, even though in a particular item both rated it with a 5. The problem is that the user's means are being calculated using ONLY the items in common between the two users, leading to strange similarity computations (or worse, to no similarity at all!). It is not difficult to find small examples showing this behaviour, besides, seminal papers assume the overall mean rating is used [1, 2].

Since I am a newbie on this patch and bug/fix terminology, I would like to know what is the best (or the only?) way of including this finding. I have to say that I already have fixed the code (it affects to the AbstractSimilarity class, and therefore, it would have an impact on other similarity functions too).

Best regards,
Alejandro

[1] M. J. Pazzani: "A framework for collaborative, content-based and demographic filtering". Artificial Intelligence Review 13, pp. 393-408. 1999 [2] C. Desrosiers, G. Karypis: "A comprehensive survey of neighborhood-based recommendation methods". Recommender Systems Handbook, chapter 4. 2010

--
 Alejandro Bellogin Kouki
 http://rincon.uam.es/dir?cw=435275268554687

Reply via email to