Yes this is very interesting --

On Mon, Feb 22, 2010 at 2:16 PM, Tamas Jambor <[email protected]> wrote:
> If I take all the items with an average rating of less than 2.5, and
> calculate the probability that
> they will be get the highest score for each user (ie ranked first), I get
> higher probability
> than for items that have an average rating of 3.5.

Yes that doesn't seem likely. There could be some bias in that you
can't recommend items you already have rated, which would tend to be
higher-ranked items, but I doubt that's the issue.

Try another similarity metric? You don't have to use Pearson, see below.


> I think the reason why it is biased is that in item-based recommendation
> most of the time you can find
> some kind of correlation between any given items. and even it is negatively
> correlated you take it into account towards the score.
> For example if I take 4 items rated 1,1,5,5 by the user and the correlation
> between the target item is 1,1,0,0 respectively, I get 2 using
> your calculation and 1 using the standard one as follows:
>
> preference += theSimilarity * prefs.getValue(i);
> totalSimilarity += Math.abs(theSimilarity);
> score = preference / totalSimilarity;

What if the weights are 1,1,-1,-1? The estimate is -2 then. This is
why I say this won't work.

While in general I could ask why 2 is necessarily the "wrong" answer
and 1 is "right" -- in the case Pearson I agree that 1 is the right
answer. This isn't necessarily true for other similarity measures,
where 0 doesn't have to mean "no mutual information".

But perhaps I have overlooked another way to 'fix' the negative weight
issue that is also compatible with Pearson's characteristics?


In the world of users, I would argue that a similarity of 0, even when
it is a 0 from a Pearson correlation, means there is *some*
relationship between the two users -- they overlap in some items out
the very many out there, which is a positive association. So,
factoring in uncorrelated users is, I would say, more valid than
ignoring them. That's one reason I actually like the effect of the
"+1" over "+0".

I think this is less true for items, as you say, since in many cases
(like yours I think) there are more users than items. It is more
likely to be able to compute some similarity between items; the
existence of any similarity at all means less. The "+1" could distort
more than "+0" -- but again I am not sure what else to do as "+0"
leads to ill-defined results.


But for your purposes you can easily adjust the implementation if you
like. You could drop all non-positive similarities from consideration
for example. You could just use a different implementation if it works
better.

Reply via email to