Re: Bug in similarity computation

Sean Owen Wed, 06 Apr 2011 09:44:12 -0700

No, I don't think it's anything to do with NaN. The result and
implementation is quite by design.


I really don't understand this talk of "non standard" Pearson
correlation. On the contrary, the implementation is quite strictly a
Pearson correlation. The request seems to be to "fix" the computation
to, say, compute a Pearson correlation on series like (1,2) and
(3,6,1,2). This isn't even well-formed -- the series aren't of the
same size.

It makes sense if you want to pad the series to be of equal size.
That's a good question. But it's not a question of how the Pearson
correlation is defined or implemented, but of how the data fed into it
is "defined".

And I'm saying it's a valid variant, one implemented already.


On Wed, Apr 6, 2011 at 5:11 PM, Daniel McEnnis <[email protected]> wrote:
> Alejandro,
>
> The difficulty lies in that values that are normally zero are in fact
> Double.NaN.  Including these extra values to get a correct result
> means, invariably, ending up with Double.NaN as a result.  To avoid
> this, Mahout uses non-standard implementations that only considers
> co-occurrence entries in the result.  Whether these distance metrics
> should be called the same as their non-recommender cousins is a
> question for debate....
>
> Daniel.
>
> On Wed, Apr 6, 2011 at 12:00 PM, Alejandro Bellogin Kouki
> <[email protected]> wrote:
>> Hi,
>>
>> maybe I didn't express myself correctly... I'm talking about the calculation
>> of user's or item's mean (R_i in Sarwar's paper), which should be computed
>> using ALL the items of that user/item, BUT in Mahout it is computed using
>> only the items corated by both users/items.
>>
>> This causes strange effects, for instance, if we have two users with two
>> items in common, and other unknown ratings:
>>     i1  i2  i3 i4
>> u1  4   4   -- 5
>> u2  3   3   5  --
>>
>> the current code in Mahout computes the mean of u1 as 4, and of u2 as 3,
>> which leads to a 0 when it is used for centering the data, instead of 4.3
>> and 3.6, resp.
>>
>> I hope it is more clear now.
>>
>> Alejandro
>>
>> Sebastian Schelter escribió:
>>>
>>> IIRC Sarwar et.al.'s "Item-Based Collaborative Filtering Recommendation
>>> Algorithms" explicitly mentions to only use the co-rated cases for Pearson
>>> correlation.
>>>
>>> --sebastian
>>>
>>> On 06.04.2011 17:33, Sean Owen wrote:
>>>>
>>>> It's a good question.
>>>>
>>>> The Pearson correlation of two series does not change if the series
>>>> means change. That is, subtracting the same value from all elements of
>>>> one series (or scaling the values) doesn't change the correlation. In
>>>> that sense, I would not say you must center the series to make either
>>>> one's mean 0. It wouldn't make a difference, no matter what number you
>>>> subtracted, even if it were the mean of all ratings by the user.
>>>>
>>>> The code you see in the project *does* center the data, because *if*
>>>> the means are 0, then the computation result is the same as the cosine
>>>> measure, and that seems nice. (There's also an uncentered cosine
>>>> measure version.)
>>>>
>>>>
>>>> What I think you're really getting at is, can't we expand the series
>>>> to include all items that either one or the other user rated? Then the
>>>> question is, what are the missing values you want to fill in? There's
>>>> not a great answer to that, since any answer is artificial, but
>>>> picking the user's mean rating is a decent choice. This is not quite
>>>> the same as centering.
>>>>
>>>> You can do that in Mahout -- use AveragingPreferenceInferrer to do
>>>> exactly this with these similarity metrics. It will slow things down
>>>> and anecdotally I don't think it's worth it, but it's certainly there.
>>>>
>>>> I don't think the normal version, without a PreferenceInferrer, is
>>>> "wrong". It is just implementing the Pearson correlation on all data
>>>> available, and you have to add a setting to tell it to make up data.
>>>>
>>>>
>>>>
>>>> On Wed, Apr 6, 2011 at 3:13 PM, Alejandro Bellogin Kouki
>>>> <[email protected]>  wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I've been using Mahout for many years now, mainly for my Master's
>>>>> thesis,
>>>>> and now for my PhD thesis. That is why, first, I want to congratulate
>>>>> you
>>>>> for the effort of putting such a library as open source.
>>>>>
>>>>> At this point, my main concern is recommendation, and, because of that,
>>>>> I
>>>>> have been using the different recommenders, evaluators and similarities
>>>>> implemented in the library. However, today, after many times inspecting
>>>>> your
>>>>> code, I have found, IMHO, a relevant bug with further implications.
>>>>>
>>>>> It is related with the computation of the similarity. Although this is
>>>>> not
>>>>> the only implemented similarity, Pearson's correlation is one of the
>>>>> most
>>>>> popular one. This similarity requires to normalise (or "center") the
>>>>> data
>>>>> using the user's mean, in order to be able to distinguish a user who
>>>>> usually
>>>>> rates items with 5's from a user who usually rates them with 3's, even
>>>>> though in a particular item both rated it with a 5. The problem is that
>>>>> the
>>>>> user's means are being calculated using ONLY the items in common between
>>>>> the
>>>>> two users, leading to strange similarity computations (or worse, to no
>>>>> similarity at all!). It is not difficult to find small examples showing
>>>>> this
>>>>> behaviour, besides, seminal papers assume the overall mean rating is
>>>>> used
>>>>> [1, 2].
>>>>>
>>>>> Since I am a newbie on this patch and bug/fix terminology, I would like
>>>>> to
>>>>> know what is the best (or the only?) way of including this finding. I
>>>>> have
>>>>> to say that I already have fixed the code (it affects to the
>>>>> AbstractSimilarity class, and therefore, it would have an impact on
>>>>> other
>>>>> similarity functions too).
>>>>>
>>>>> Best regards,
>>>>> Alejandro
>>>>>
>>>>> [1] M. J. Pazzani: "A framework for collaborative, content-based and
>>>>> demographic filtering". Artificial Intelligence Review 13, pp. 393-408.
>>>>> 1999
>>>>> [2] C. Desrosiers, G. Karypis: "A comprehensive survey of
>>>>> neighborhood-based
>>>>> recommendation methods". Recommender Systems Handbook, chapter 4. 2010
>>>>>
>>>>> --
>>>>>  Alejandro Bellogin Kouki
>>>>>  http://rincon.uam.es/dir?cw=435275268554687
>>>>>
>>>>>
>>>
>>
>> --
>>  Alejandro Bellogin Kouki
>>  http://rincon.uam.es/dir?cw=435275268554687
>>
>>
>

Re: Bug in similarity computation

Reply via email to