On Thu, Sep 26, 2013 at 7:35 AM, Nathaniel Smith <[email protected]> wrote: > On Thu, Sep 26, 2013 at 11:51 AM, <[email protected]> wrote: >> On Thu, Sep 26, 2013 at 4:21 AM, Nathaniel Smith <[email protected]> wrote: >>> If you want a proper self-consistent correlation/covariance matrix, then >>> pairwise deletion just makes no sense period, I don't see how postprocessing >>> can help. >> >> clipping to [-1, 1] and finding the nearest positive semi-definite matrix. >> For the latter there is some code in statsmodels, and several newer >> algorithms that I haven't looked at. > > While in general there's an interesting problem for how to best > estimate a joint covariance matrix in the presence of missing data, > this sounds way beyond the scope of lowly corrcoef(). The very fact > that there is active research on the problem means that we shouldn't > be trying to pick one algorithm and declare that it's *the* correct > one to build in as a basic tool for unsophisticated users. > > It's not clear that this is even what people want corrcoef to do in > general. I always use it as just a convenient way to estimate lots of > pairwise correlations, not as a way to estimate a joint correlation > matrix...
I don't expect that corrcoef or cov should provide a positive semidefinite approximation. I just wanted to point out what users can do with the return from corrcoef or cov if they want a proper correlation or covariance matrix. (same discussion with pandas, post-processing is a separate step.) > >> It's a quite common problem in finance, but usually longer time series >> with not a large number of missing values. >> >>> >>> If you want a matrix of correlations, then pairwise deletion makes sense. >>> It's an interesting point that arguably the current ma.corrcoef code may >>> actually give you a better estimator of the individual correlation >>> coefficients than doing full pairwise deletion, but it's pretty surprising >>> and unexpected... when people call corrcoef I think they are asking "please >>> compute the textbook formula for 'sample correlation'" not "please give me >>> some arbitrary good estimator for the population correlation", so we >>> probably have to change it. >>> >>> (Hopefully no-one has published anything based on the current code.) >> >> I haven't seen a textbook version of this yet. > > By textbook I mean, users expect corrcoef to use this formula, which > is printed in every textbook: > > https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#For_a_sample > The vast majority of people using correlations think that "sample > correlation" justs mean this number, not "some arbitrary finite-sample > estimator of the underlying population correlation". So the obvious > interpretation of pairwise correlations is that you apply that formula > to each set of pairwise complete observations. This textbook version **assumes** that we have the same observations for all/both variables, and doesn't say what to do if not. I'm usually mainly interested the covariance/correlation matrix for estimating some underlying population or model parameters or do hypothesis tests with them. I just wanted to point out that there is no "obvious" ("There should be one-- ...") way to define pairwise deletion correlation matrices. But maybe just doing a loop [corrcoef(x, y) for x in data for y in data] still makes the most sense. Dunno > >> Calculating every mean (n + 1) * n / 2 times sounds a bit excessive, >> especially if it doesn't really solve the problem. > > I don't understand what's "excessive" about calculating every > mean/stddev (n + 1)*n/2 times. By that logic one should never call > corrcoef at all, because calculating (n + 1)*n/2 covariances is also > excessive :-). It's not like we're talking about some massive > slowdown. > >>>> > looks like pandas might be truncating the correlations to [-1, 1] (I >>>> > didn't check) >>>> > >>>> >>>> import pandas as pd >>>> >>>> x_pd = pd.DataFrame(x_ma.T) >>>> >>>> x_pd.corr() >>>> > 0 1 2 3 >>>> > 0 1.000000 0.734367 -1.000000 -0.240192 >>>> > 1 0.734367 1.000000 -0.856565 -0.378777 >>>> > 2 -1.000000 -0.856565 1.000000 NaN >>>> > 3 -0.240192 -0.378777 NaN 1.000000 >>>> > >>>> >>>> np.round(np.ma.corrcoef(x_ma), 6) >>>> > masked_array(data = >>>> > [[1.0 0.734367 -1.190909 -1.681346] >>>> > [0.734367 1.0 -0.856565 -0.378777] >>>> > [-1.190909 -0.856565 1.0 --] >>>> > [-1.681346 -0.378777 -- 1.0]], > > That can't be truncation -- where corrcoef has -1.68, pandas has > -0.24, not -1.0. my mistake, for not reading carefully enough Josef > > R gives the same result as pandas (except that by default it > propagates NA and give a range of options for handling NAs, so you > have to explicitly request pairwise complete if you want it, and they > make this the most annoying option to type ;-), and the help page > explicitly warns that this "can result in covariance or correlation > matrices which are not positive semi-definite"): > >> x <- c(7, -4, -1, -7, -3, -2, 6, -3, 0, 4, 0, 5, -4, -5, 7, 5, -7, -7, -5, >> 5, -8, 0, 1, 4) >> x <- matrix(x, nrow=4, byrow=T) >> cor(t(x), use="pairwise.complete.obs") > [,1] [,2] [,3] [,4] > [1,] 1.0000000 0.43330535 -0.22749669 -0.5246782 > [2,] 0.4333053 1.00000000 -0.01829124 -0.2120425 > [3,] -0.2274967 -0.01829124 1.00000000 -0.6423032 > [4,] -0.5246782 -0.21204248 -0.64230323 1.0000000 > > -n > _______________________________________________ > NumPy-Discussion mailing list > [email protected] > http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
