Bug in similarity computation

Alejandro Bellogin Kouki Wed, 06 Apr 2011 07:14:17 -0700

Hi all,

I've been using Mahout for many years now, mainly for my Master'sthesis, and now for my PhD thesis. That is why, first, I want tocongratulate you for the effort of putting such a library as open source.

At this point, my main concern is recommendation, and, because of that,I have been using the different recommenders, evaluators andsimilarities implemented in the library. However, today, after manytimes inspecting your code, I have found, IMHO, a relevant bug withfurther implications.

It is related with the computation of the similarity. Although this isnot the only implemented similarity, Pearson's correlation is one of themost popular one. This similarity requires to normalise (or "center")the data using the user's mean, in order to be able to distinguish auser who usually rates items with 5's from a user who usually rates themwith 3's, even though in a particular item both rated it with a 5. Theproblem is that the user's means are being calculated using ONLY theitems in common between the two users, leading to strange similaritycomputations (or worse, to no similarity at all!). It is not difficultto find small examples showing this behaviour, besides, seminal papersassume the overall mean rating is used [1, 2].

Since I am a newbie on this patch and bug/fix terminology, I would liketo know what is the best (or the only?) way of including this finding. Ihave to say that I already have fixed the code (it affects to theAbstractSimilarity class, and therefore, it would have an impact onother similarity functions too).


Best regards,
Alejandro

[1] M. J. Pazzani: "A framework for collaborative, content-based anddemographic filtering". Artificial Intelligence Review 13, pp. 393-408. 1999[2] C. Desrosiers, G. Karypis: "A comprehensive survey ofneighborhood-based recommendation methods". Recommender SystemsHandbook, chapter 4. 2010


--
 Alejandro Bellogin Kouki
 http://rincon.uam.es/dir?cw=435275268554687

Bug in similarity computation

Reply via email to