Re: Bug in similarity computation

Daniel McEnnis Wed, 06 Apr 2011 09:11:49 -0700

Alejandro,

The difficulty lies in that values that are normally zero are in fact
Double.NaN.  Including these extra values to get a correct result
means, invariably, ending up with Double.NaN as a result.  To avoid
this, Mahout uses non-standard implementations that only considers
co-occurrence entries in the result.  Whether these distance metrics
should be called the same as their non-recommender cousins is a
question for debate....


Daniel.

On Wed, Apr 6, 2011 at 12:00 PM, Alejandro Bellogin Kouki
<[email protected]> wrote:
> Hi,
>
> maybe I didn't express myself correctly... I'm talking about the calculation
> of user's or item's mean (R_i in Sarwar's paper), which should be computed
> using ALL the items of that user/item, BUT in Mahout it is computed using
> only the items corated by both users/items.
>
> This causes strange effects, for instance, if we have two users with two
> items in common, and other unknown ratings:
>     i1  i2  i3 i4
> u1  4   4   -- 5
> u2  3   3   5  --
>
> the current code in Mahout computes the mean of u1 as 4, and of u2 as 3,
> which leads to a 0 when it is used for centering the data, instead of 4.3
> and 3.6, resp.
>
> I hope it is more clear now.
>
> Alejandro
>
> Sebastian Schelter escribió:
>>
>> IIRC Sarwar et.al.'s "Item-Based Collaborative Filtering Recommendation
>> Algorithms" explicitly mentions to only use the co-rated cases for Pearson
>> correlation.
>>
>> --sebastian
>>
>> On 06.04.2011 17:33, Sean Owen wrote:
>>>
>>> It's a good question.
>>>
>>> The Pearson correlation of two series does not change if the series
>>> means change. That is, subtracting the same value from all elements of
>>> one series (or scaling the values) doesn't change the correlation. In
>>> that sense, I would not say you must center the series to make either
>>> one's mean 0. It wouldn't make a difference, no matter what number you
>>> subtracted, even if it were the mean of all ratings by the user.
>>>
>>> The code you see in the project *does* center the data, because *if*
>>> the means are 0, then the computation result is the same as the cosine
>>> measure, and that seems nice. (There's also an uncentered cosine
>>> measure version.)
>>>
>>>
>>> What I think you're really getting at is, can't we expand the series
>>> to include all items that either one or the other user rated? Then the
>>> question is, what are the missing values you want to fill in? There's
>>> not a great answer to that, since any answer is artificial, but
>>> picking the user's mean rating is a decent choice. This is not quite
>>> the same as centering.
>>>
>>> You can do that in Mahout -- use AveragingPreferenceInferrer to do
>>> exactly this with these similarity metrics. It will slow things down
>>> and anecdotally I don't think it's worth it, but it's certainly there.
>>>
>>> I don't think the normal version, without a PreferenceInferrer, is
>>> "wrong". It is just implementing the Pearson correlation on all data
>>> available, and you have to add a setting to tell it to make up data.
>>>
>>>
>>>
>>> On Wed, Apr 6, 2011 at 3:13 PM, Alejandro Bellogin Kouki
>>> <[email protected]>  wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I've been using Mahout for many years now, mainly for my Master's
>>>> thesis,
>>>> and now for my PhD thesis. That is why, first, I want to congratulate
>>>> you
>>>> for the effort of putting such a library as open source.
>>>>
>>>> At this point, my main concern is recommendation, and, because of that,
>>>> I
>>>> have been using the different recommenders, evaluators and similarities
>>>> implemented in the library. However, today, after many times inspecting
>>>> your
>>>> code, I have found, IMHO, a relevant bug with further implications.
>>>>
>>>> It is related with the computation of the similarity. Although this is
>>>> not
>>>> the only implemented similarity, Pearson's correlation is one of the
>>>> most
>>>> popular one. This similarity requires to normalise (or "center") the
>>>> data
>>>> using the user's mean, in order to be able to distinguish a user who
>>>> usually
>>>> rates items with 5's from a user who usually rates them with 3's, even
>>>> though in a particular item both rated it with a 5. The problem is that
>>>> the
>>>> user's means are being calculated using ONLY the items in common between
>>>> the
>>>> two users, leading to strange similarity computations (or worse, to no
>>>> similarity at all!). It is not difficult to find small examples showing
>>>> this
>>>> behaviour, besides, seminal papers assume the overall mean rating is
>>>> used
>>>> [1, 2].
>>>>
>>>> Since I am a newbie on this patch and bug/fix terminology, I would like
>>>> to
>>>> know what is the best (or the only?) way of including this finding. I
>>>> have
>>>> to say that I already have fixed the code (it affects to the
>>>> AbstractSimilarity class, and therefore, it would have an impact on
>>>> other
>>>> similarity functions too).
>>>>
>>>> Best regards,
>>>> Alejandro
>>>>
>>>> [1] M. J. Pazzani: "A framework for collaborative, content-based and
>>>> demographic filtering". Artificial Intelligence Review 13, pp. 393-408.
>>>> 1999
>>>> [2] C. Desrosiers, G. Karypis: "A comprehensive survey of
>>>> neighborhood-based
>>>> recommendation methods". Recommender Systems Handbook, chapter 4. 2010
>>>>
>>>> --
>>>>  Alejandro Bellogin Kouki
>>>>  http://rincon.uam.es/dir?cw=435275268554687
>>>>
>>>>
>>
>
> --
>  Alejandro Bellogin Kouki
>  http://rincon.uam.es/dir?cw=435275268554687
>
>

Re: Bug in similarity computation

Reply via email to