Re: Bug in similarity computation

Alejandro Bellogin Kouki Wed, 06 Apr 2011 09:52:35 -0700

I agree with Sean in that the current Mahout's implementation is aPearson correlation, since it only considers paired items (as you said,it does not make sense to correlate two series like that). However, theproblem is that, in recommendation, when they use this correlation as asimilarity measure, the mean of each variable (i.e., user or item) isnot strictly the mean of the observed values in the series beingcorrelated, but it needs to consider some extra values (those items notco-reated with the other user).

So, perhaps this is only a notation problem, and this distance shouldnot be consider equivalent to that cited in the references alreadymentioned.


Alejandro

Sean Owen escribió:

No, I don't think it's anything to do with NaN. The result and
implementation is quite by design.

I really don't understand this talk of "non standard" Pearson
correlation. On the contrary, the implementation is quite strictly a
Pearson correlation. The request seems to be to "fix" the computation
to, say, compute a Pearson correlation on series like (1,2) and
(3,6,1,2). This isn't even well-formed -- the series aren't of the
same size.

It makes sense if you want to pad the series to be of equal size.
That's a good question. But it's not a question of how the Pearson
correlation is defined or implemented, but of how the data fed into it
is "defined".

And I'm saying it's a valid variant, one implemented already.


On Wed, Apr 6, 2011 at 5:11 PM, Daniel McEnnis <[email protected]> wrote:

Alejandro,

The difficulty lies in that values that are normally zero are in fact
Double.NaN.  Including these extra values to get a correct result
means, invariably, ending up with Double.NaN as a result.  To avoid
this, Mahout uses non-standard implementations that only considers
co-occurrence entries in the result.  Whether these distance metrics
should be called the same as their non-recommender cousins is a
question for debate....

Daniel.

On Wed, Apr 6, 2011 at 12:00 PM, Alejandro Bellogin Kouki
<[email protected]> wrote:

Hi,

maybe I didn't express myself correctly... I'm talking about the calculation
of user's or item's mean (R_i in Sarwar's paper), which should be computed
using ALL the items of that user/item, BUT in Mahout it is computed using
only the items corated by both users/items.

This causes strange effects, for instance, if we have two users with two
items in common, and other unknown ratings:
    i1  i2  i3 i4
u1  4   4   -- 5
u2  3   3   5  --

the current code in Mahout computes the mean of u1 as 4, and of u2 as 3,
which leads to a 0 when it is used for centering the data, instead of 4.3
and 3.6, resp.

I hope it is more clear now.

Alejandro

Sebastian Schelter escribió:

IIRC Sarwar et.al.'s "Item-Based Collaborative Filtering Recommendation
Algorithms" explicitly mentions to only use the co-rated cases for Pearson
correlation.

--sebastian

On 06.04.2011 17:33, Sean Owen wrote:

It's a good question.

The Pearson correlation of two series does not change if the series
means change. That is, subtracting the same value from all elements of
one series (or scaling the values) doesn't change the correlation. In
that sense, I would not say you must center the series to make either
one's mean 0. It wouldn't make a difference, no matter what number you
subtracted, even if it were the mean of all ratings by the user.

The code you see in the project *does* center the data, because *if*
the means are 0, then the computation result is the same as the cosine
measure, and that seems nice. (There's also an uncentered cosine
measure version.)


What I think you're really getting at is, can't we expand the series
to include all items that either one or the other user rated? Then the
question is, what are the missing values you want to fill in? There's
not a great answer to that, since any answer is artificial, but
picking the user's mean rating is a decent choice. This is not quite
the same as centering.

You can do that in Mahout -- use AveragingPreferenceInferrer to do
exactly this with these similarity metrics. It will slow things down
and anecdotally I don't think it's worth it, but it's certainly there.

I don't think the normal version, without a PreferenceInferrer, is
"wrong". It is just implementing the Pearson correlation on all data
available, and you have to add a setting to tell it to make up data.



On Wed, Apr 6, 2011 at 3:13 PM, Alejandro Bellogin Kouki
<[email protected]>  wrote:

Hi all,

I've been using Mahout for many years now, mainly for my Master's
thesis,
and now for my PhD thesis. That is why, first, I want to congratulate
you
for the effort of putting such a library as open source.

At this point, my main concern is recommendation, and, because of that,
I
have been using the different recommenders, evaluators and similarities
implemented in the library. However, today, after many times inspecting
your
code, I have found, IMHO, a relevant bug with further implications.

It is related with the computation of the similarity. Although this is
not
the only implemented similarity, Pearson's correlation is one of the
most
popular one. This similarity requires to normalise (or "center") the
data
using the user's mean, in order to be able to distinguish a user who
usually
rates items with 5's from a user who usually rates them with 3's, even
though in a particular item both rated it with a 5. The problem is that
the
user's means are being calculated using ONLY the items in common between
the
two users, leading to strange similarity computations (or worse, to no
similarity at all!). It is not difficult to find small examples showing
this
behaviour, besides, seminal papers assume the overall mean rating is
used
[1, 2].

Since I am a newbie on this patch and bug/fix terminology, I would like
to
know what is the best (or the only?) way of including this finding. I
have
to say that I already have fixed the code (it affects to the
AbstractSimilarity class, and therefore, it would have an impact on
other
similarity functions too).

Best regards,
Alejandro

[1] M. J. Pazzani: "A framework for collaborative, content-based and
demographic filtering". Artificial Intelligence Review 13, pp. 393-408.
1999
[2] C. Desrosiers, G. Karypis: "A comprehensive survey of
neighborhood-based
recommendation methods". Recommender Systems Handbook, chapter 4. 2010

--
 Alejandro Bellogin Kouki
 http://rincon.uam.es/dir?cw=435275268554687

--
 Alejandro Bellogin Kouki
 http://rincon.uam.es/dir?cw=435275268554687


--
 Alejandro Bellogin Kouki
 http://rincon.uam.es/dir?cw=435275268554687

Re: Bug in similarity computation

Reply via email to