[ https://issues.apache.org/jira/browse/MAHOUT-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated MAHOUT-387: ----------------------------- Status: Resolved (was: Patch Available) Assignee: Sean Owen Fix Version/s: 0.3 Resolution: Won't Fix Yes like Jeff said, this actually exists as PearsonCorrelationSimilarity. In the case where the mean of each series is 0, the result is the same. The fastest way I know to see this is to just look at this form of the sample correlation: http://upload.wikimedia.org/math/c/a/6/ca68fbe94060a2591924b380c9bc4e27.png ... and note that sum(xi) = sum (yi) = 0 when the mean of xi and yi are 0. You're left with sum(xi*yi) in the numerator, which is the dot product, and sqrt(sum(xi^2)) * sqrt(sum(yi^2)) in the denominator, which are the vector sizes. This is just the cosine of the angle between x and y. One can argue whether forcing the data to be centered is right. I think it's a good thing in all cases. It adjusts for a user's tendency to rate high or low on average. It also makes the computation simpler, and more consistent with Pearson (well, it makes it identical!). This has a good treatment: http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Geometric_interpretation Only for this reason I'd mark this as won't-fix for the moment; the patch is otherwise nice. I'd personally like to hear more about why to not center if there's an argument for it. > Cosine item similarity implementation > ------------------------------------- > > Key: MAHOUT-387 > URL: https://issues.apache.org/jira/browse/MAHOUT-387 > Project: Mahout > Issue Type: New Feature > Components: Collaborative Filtering > Reporter: Sebastian Schelter > Assignee: Sean Owen > Fix For: 0.3 > > Attachments: MAHOUT-387.patch > > > I needed to compute the cosine similarity between two items when running > org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob, I couldn't find an > implementation (did I overlook it maybe?) so I created my own. I want to > share it here, in case you find it useful. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.