Re: Bug in similarity computation

Grant Ingersoll Wed, 06 Apr 2011 07:33:30 -0700

Hi Alejandro,

I won't comment on the issue itself (I am sure Sean and others will), since I 
haven't looked at the code, but 
https://cwiki.apache.org/confluence/display/MAHOUT/How+To+Contribute describes 
how to submit a patch.  File a ticket in JIRA and provide the patch along with 
your test cases.


-Grant

On Apr 6, 2011, at 10:13 AM, Alejandro Bellogin Kouki wrote:

> Hi all,
> 
> I've been using Mahout for many years now, mainly for my Master's thesis, and 
> now for my PhD thesis. That is why, first, I want to congratulate you for the 
> effort of putting such a library as open source.
> 
> At this point, my main concern is recommendation, and, because of that, I 
> have been using the different recommenders, evaluators and similarities 
> implemented in the library. However, today, after many times inspecting your 
> code, I have found, IMHO, a relevant bug with further implications.
> 
> It is related with the computation of the similarity. Although this is not 
> the only implemented similarity, Pearson's correlation is one of the most 
> popular one. This similarity requires to normalise (or "center") the data 
> using the user's mean, in order to be able to distinguish a user who usually 
> rates items with 5's from a user who usually rates them with 3's, even though 
> in a particular item both rated it with a 5. The problem is that the user's 
> means are being calculated using ONLY the items in common between the two 
> users, leading to strange similarity computations (or worse, to no similarity 
> at all!). It is not difficult to find small examples showing this behaviour, 
> besides, seminal papers assume the overall mean rating is used [1, 2].
> 
> Since I am a newbie on this patch and bug/fix terminology, I would like to 
> know what is the best (or the only?) way of including this finding. I have to 
> say that I already have fixed the code (it affects to the AbstractSimilarity 
> class, and therefore, it would have an impact on other similarity functions 
> too).
> 
> Best regards,
> Alejandro
> 
> [1] M. J. Pazzani: "A framework for collaborative, content-based and 
> demographic filtering". Artificial Intelligence Review 13, pp. 393-408. 1999
> [2] C. Desrosiers, G. Karypis: "A comprehensive survey of neighborhood-based 
> recommendation methods". Recommender Systems Handbook, chapter 4. 2010
> 
> -- 
> Alejandro Bellogin Kouki
> http://rincon.uam.es/dir?cw=435275268554687
> 

--------------------------
Grant Ingersoll
Lucene Revolution -- Lucene and Solr User Conference
May 25-26 in San Francisco
www.lucenerevolution.org

Re: Bug in similarity computation

Reply via email to