Hi all,
I've been using Mahout for many years now, mainly for my Master's
thesis, and now for my PhD thesis. That is why, first, I want to
congratulate you for the effort of putting such a library as open source.
At this point, my main concern is recommendation, and, because of that,
I have been using the different recommenders, evaluators and
similarities implemented in the library. However, today, after many
times inspecting your code, I have found, IMHO, a relevant bug with
further implications.
It is related with the computation of the similarity. Although this is
not the only implemented similarity, Pearson's correlation is one of the
most popular one. This similarity requires to normalise (or "center")
the data using the user's mean, in order to be able to distinguish a
user who usually rates items with 5's from a user who usually rates them
with 3's, even though in a particular item both rated it with a 5. The
problem is that the user's means are being calculated using ONLY the
items in common between the two users, leading to strange similarity
computations (or worse, to no similarity at all!). It is not difficult
to find small examples showing this behaviour, besides, seminal papers
assume the overall mean rating is used [1, 2].
Since I am a newbie on this patch and bug/fix terminology, I would like
to know what is the best (or the only?) way of including this finding. I
have to say that I already have fixed the code (it affects to the
AbstractSimilarity class, and therefore, it would have an impact on
other similarity functions too).
Best regards,
Alejandro
[1] M. J. Pazzani: "A framework for collaborative, content-based and
demographic filtering". Artificial Intelligence Review 13, pp. 393-408. 1999
[2] C. Desrosiers, G. Karypis: "A comprehensive survey of
neighborhood-based recommendation methods". Recommender Systems
Handbook, chapter 4. 2010
--
Alejandro Bellogin Kouki
http://rincon.uam.es/dir?cw=435275268554687