[
https://issues.apache.org/jira/browse/MAHOUT-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158051#comment-13158051
]
Paulo Villegas commented on MAHOUT-898:
---------------------------------------
I think that your example is actually desired behaviour :-) Pearson correlation
measures linear dependency between two variables; if it's 0 it means they are
independent from each other (at least linearly) so that that item shouldn't
influence your preference, and it does work that way. But if it has a negative
value, it means that there is a linear dependence with negative slope. That is,
my preferences for the item being estimated are negatively correlated with
those other items: when they have a high rating, mine for the new item should
be low. So, if the items have 3 & 4, giving a 1 (capping to the minimum) is not
totally unreasonable, though perhaps a bit extreme (having only items with
negative correlations shouldn't happen too often anyway, though I've indeed
seen that).
Even though Pearson is the only metric producing negative values, it is not a
fringe case, since it is probably the most used metric for neighborhood CF (and
for good reason -- it tends to produce the best results and it costs much less
than rank-based metrics such as Spearman). Hence ensuring it behaves reasonably
is good.
I saw the (1+similarity) variant when looking at previous versions, it comes
from issue MAHOUT-321. But the problem, when it comes to Pearson, is that it
enables items with correlation of 0 to have influence on the final result (and
they shouldn't, since they are uncorrelated with the item being computed).
The issue would probably work better if ratings could be mean-centered (i.e.
remove the mean before getting into the preference estimation), which is also a
standard practice. I'm trying to do something along this, but in the mean time
I proposed the 'abs' solution to at least avoid bizarre outputs (the current
behaviour produces 'surprising' recommendations, and while some serendipity is
a desired behaviour in a recommender, it would be better to have a way of
controlling it).
> Error in formula for preference estimation in GenericItemBasedRecommender
> -------------------------------------------------------------------------
>
> Key: MAHOUT-898
> URL: https://issues.apache.org/jira/browse/MAHOUT-898
> Project: Mahout
> Issue Type: Bug
> Components: Collaborative Filtering
> Environment: mahout-core
> Reporter: Paulo Villegas
> Assignee: Sean Owen
> Priority: Minor
> Labels: patch
> Fix For: 0.6
>
> Attachments: GenericItemBasedRecommender.diff
>
>
> The formula to estimate the preference for an item in the Taste item-based
> recommender normalizes by the sum of similarities for items used in
> estimation. But the terms in the sum taken to normalize should be in absolute
> value, since they can be negative (e.g. when using Pearson correlation,
> similarity is in [-1,1]). Now they are not, and as a result when there are
> negative and positive values they cancel out, giving a small denominator and
> incorrectly boosting the preference for the item (symptom: it is easy for a
> predicted preference to take the maximum value, since the quotient becomes
> large and it is capped afterwards)
> The patch is rather trivial (a one-liner, actually) for
> src/main/java/org/apache/mahout/cf/taste/impl/recommender/GenericItemBasedRecommender.java
> Note: the same error & suggested fix happens in GenericUserBasedRecommender
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira