i really just want to get the sample covariance which is: sum(X_i - meanX)(Y_i - meanY)/N-1
this is just pearson_x,y * sdX * sdY i think sumXY/N-1 should be the right one. srowen wrote: > > I'm not so familiar with this formula but you seem to be missing > something in the denominator... it's got to normalize somehow. I think > I said divide by standard deviation but that's not quite it. What you > are really summing are the products of z-scores. See > http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient > > But I think you should just use the formulation given in the code? > should be the same result. At least I hope these aren't different > definitions of Pearson! > > On Fri, Nov 27, 2009 at 10:20 AM, jamborta <[email protected]> wrote: >> >> thanks you. much clearer now. >> >> so for my purpose this will do: >> >> sumXY/N-1 >> >> given that the data is 'centered'? >> >> >> On Fri, Nov 27, 2009 at 1:41 AM, jamborta <[email protected]> wrote: >>> >>> hi. I tried to figure out how you calcualte pearson correlation, but it >>> looks >>> like you use this formula: >>> >>> sumXY / sqrt(sumX2 * sumY2) >> >> Yes that's right -- this is what Pearson reduces to when the mean of X >> and Y are 0. And they are here -- the implementation 'centers' the >> data. >> >>> where sumXY = sumXY - meanY * sumX; >>> sumX2 = sumX2 - meanX * sumX; >>> sumY2 = sumY2 - meanY * sumY; >> >> You see the lines commented out there? Those are the full forms of the >> expressions, which may make more sense. This is centering the data, >> making the mean 0. >> >> This is a simplification based on the observation that, for example, >> sumX * meanY = sumY * meanX = n * meanY * meanX. >> >>> >>> i don't really understand how you got these equations. could you explain >>> it >>> to me? I thought pearson correlation would be like this >>> >>> E(x_i-meanX)(y_i-meanY) / sdX*sdY >> >> That's right that's the expression for a population correlation, but >> we can really only compute a sample Pearson correlation coefficient, >> yes: >> >> >>> for my project I would need to get sample correlation coefficient which >>> would be something like this: >>> >>> sum(x_i-meanX)(y_i-meanY)/(N-1) >> >> Yeah that's fine too, this is another way of expressing the formula, >> though you're missing the two standard deviations in the denominator. >> It'll be clearer if I note that the mean of X and Y are 0. >> >> >> >> -- >> View this message in context: >> http://old.nabble.com/Mahout-Taste-covariance-between-two-items-tp26530825p26540395.html >> Sent from the Mahout User List mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://old.nabble.com/Mahout-Taste-covariance-between-two-items-tp26530825p26541591.html Sent from the Mahout User List mailing list archive at Nabble.com.
