[ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776986#action_12776986
 ] 

Sean Owen commented on MAHOUT-103:
----------------------------------

What's the problem in this example? Two people that have both seen all three 
Matrix films are probably similar. All the more so if they've rated the first 
one highly and the other two poorly. You'd correctly identify them as similar 
with or without ratings here.

The issue, I suppose, comes up when you encounter someone who didn't like the 
first one and liked the other two (strange, I know). Without pref values, we'd 
draw the same conclusion -- they have some similarity. With pref values, most 
metrics would say they are very dissimilar.

I actually think that's the wrong conclusion! The fact that two people bothered 
to watch all three says much more about their similarities than the variance in 
ratings says about their differences. I'd still guess they're sorta-similar, 
and metrics without pref values would tend to draw the more correct conclusion.


Of course there's no one right answer, and we can easily construct situations 
where throwing out pref values indeed hurts the result. I'm only asserting that 
it's entirely possible, in real data sets, for ratings to *hurt* on the whole. 


Let's start by adding the basic approach and then keep going to look at 
variations. I at least have some global knowledge of how the framework is set 
up and could help design in these variations in a way that's consistent with 
the framework.



> Co-occurence based nearest neighbourhood
> ----------------------------------------
>
>                 Key: MAHOUT-103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-103
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>            Reporter: Ankur
>            Assignee: Ankur
>         Attachments: jira-103.patch
>
>
> Nearest neighborhood type queries for users/items can be answered efficiently 
> and effectively by analyzing the co-occurrence model of a user/item w.r.t 
> another. This patch aims at providing an implementation for answering such 
> queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to