On Thu, Sep 26, 2013 at 6:42 PM, Nathaniel Smith <[email protected]> wrote: > On 26 Sep 2013 21:59, "Faraz Mirzaei" <[email protected]> wrote: >> >> Thanks Josef and Nathaniel for your responses. >> >> In the application that I have, I don't use the correlation coefficient >> matrix as a whole (so I don't care if it is PSD or not). I simply read the >> off-diagonal elements for pair-wise correlation coefficients. I use the >> pairwise correlation coefficient to test if the data from various sources >> (i.e., rows of the matrix), agree with each other when present. >> >> Right now, I use, ma.corrcoef( x[ i, :] , x[ j, :]) and read the >> off-diagonal element in a loop over i and j. It is just a bit uglier than >> calling ma.corrcoef(x). >> >> At least for my application, truncation to -1 or +1 (or scaling such that >> largest values becomes 1 etc) is completely wrong, since it would imply that >> the two sources completely agree with each other (factoring out a minus >> sign), which may not the case. For example, consider the first and last rows >> of the example I provided: >> >> >>> print x_ma >> [[ 7 -4 -1 -- -3 -2] >> [ 6 -3 0 4 0 5] >> [-4 -- 7 5 -- --] >> [-- 5 -- 0 1 4]] >> >> >>> np.ma.corrcoef(x_ma)[0,3] >> -1.6813456149534147 >> >> >> On the other hand, if we supply only the first and third row to the >> function, we get: >> >> >>> np.ma.corrcoef(x_ma[0,:], x_ma[3,:]) >> masked_array(data = >> [[1.0 -0.240192230708] >> [-0.240192230708 1.0]], >> mask = >> [[False False] >> [False False]], >> fill_value = 1e+20) >> >> Interestingly, this is the same as what pandas results as the [3,0] >> element of the correlation coefficient matrix, and it is equal to pair-wise >> deletion result: >> >> >>> np.corrcoef([-4, -3, -2], [5, 1, 4]) #Note that this is NOT >> >>> np.ma.corrcoef >> >>> >> array([[ 1. , -0.24019223], >> [-0.24019223, 1. ]]) >> >> >> Also, I don't know why the ma.corrcoef results Josef has mentioned are >> different than mine. In particular, Josef reports element [2, 0] of the >> ma.corrcoef result to be -1.19, but I get -- (i.e., missing and masked, >> probably due to too few samples available). Josef: are you sure that you >> have entered the example values correctly into python? Along the same lines, >> the results that Nathaniel has posted from R are different, since the input >> is not a masked matrix I guess (please note that in the original example, I >> had masked values less than or equal to -5). > > Yes, sorry, this is just a cut and paste error - in fact the result I posted > is what R gives for the stay with values <= -5 replaced by NA, but I left > this line out of the email. > > I think the only difference is that R and pandas give a correlation of 1.0 > when there are only 1 or 2 data points, and ma.corrcoef returns masked in > this case. Not sure which makes more sense. > >> >> In any case, I think the correlation coefficient between two rows of a >> matrix should not depend on what other rows are supplied. In other words, >> np.ma.corrcoef(x_ma)[0,3] should be equal to np.ma.corrcoef(x_ma[0,:], >> x_ma[3,:])[0,1] (which apparently happens to be what pandas reports). >> >> This change would need recomputing the mean for every pair-wise >> coefficient calculation, but since we are computing cross products O(n^2) >> times, the overall big-O complexity won't change. >> >> And please don't remove this functionality. I will volunteer to fix it >> however we decide :) We can just clarify the behavior in the documentation. > > In the long run I prefer R's behaviour of requiring the user to specify > before skipping anything, but I tend to agree that in the short term > pairwise deletion is what ma.corrcoef users expect and what we should do. > Maybe you could implement the fix and we could move the discussion to the > PR?
pandas has a cython function in algos that loops over all pairs and calculates mean, cross product and standard deviation for each pair separately. I agree that that would be the best choice for pairwise deletion for np.ma.corrcoef, and cov Josef > > -n > > > _______________________________________________ > NumPy-Discussion mailing list > [email protected] > http://mail.scipy.org/mailman/listinfo/numpy-discussion > _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
