On Thu, Jan 26, 2012 at 6:43 PM, <[email protected]> wrote: > On Thu, Jan 26, 2012 at 3:58 PM, Bruce Southey <[email protected]> wrote: >> On Thu, Jan 26, 2012 at 12:45 PM, <[email protected]> wrote: >>> On Thu, Jan 26, 2012 at 1:25 PM, Bruce Southey <[email protected]> wrote: >>>> On Thu, Jan 26, 2012 at 10:07 AM, Pierre Haessig >>>> <[email protected]> wrote: >>>>> Le 26/01/2012 15:57, Bruce Southey a écrit : >>>>>> Can you please provide a >>>>>> couple of real examples with expected output that clearly show what >>>>>> you want? >>>>>> >>>>> Hi Bruce, >>>>> >>>>> Thanks for your ticket feedback ! It's precisely because I see a big >>>>> potential impact of the proposed change that I send first a ML message, >>>>> second a ticket before jumping to a pull-request like a Sergio Leone's >>>>> cowboy (sorry, I watched "for a few dollars more" last weekend...) >>>>> >>>>> Now, I realize that in the ticket writing I made the wrong trade-off >>>>> between conciseness and accuracy which led to some of the errors you >>>>> raised. Let me try to use your example to try to share what I have in >>>>> mind. >>>>> >>>>>> >> X = array([-2.1, -1. , 4.3]) >>>>>> >> Y = array([ 3. , 1.1 , 0.12]) >>>>> >>>>> Indeed, with today's cov behavior we have a 2x2 array: >>>>>> >> cov(X,Y) >>>>> array([[ 11.71 , -4.286 ], >>>>> [ -4.286 , 2.14413333]]) >>>>> >>>>> Now, when I used the word 'concatenation', I wasn't precise enough >>>>> because I meant assembling X and Y in the sense of 2 vectors of >>>>> observations from 2 random variables X and Y. >>>>> This is achieved by concatenate(X,Y) *when properly playing with >>>>> dimensions* (which I didn't mentioned) : >>>>>> >> XY = np.concatenate((X[None, :], Y[None, :])) >>>>> array([[-2.1 , -1. , 4.3 ], >>>>> [ 3. , 1.1 , 0.12]]) >>>> >>>> In this context, I find stacking, np.vstack((X,Y)), more appropriate >>>> than concatenate. >>>> >>>>> >>>>> In this case, I can indeed say that "cov(X,Y) is equivalent to cov(XY)". >>>>>> >> np.cov(XY) >>>>> array([[ 11.71 , -4.286 ], >>>>> [ -4.286 , 2.14413333]]) >>>>> >>>> Sure the resulting array is the same but whole process is totally >>>> different. >>>> >>>> >>>>> (And indeed, the actual cov Python code does use concatenate() ) >>>> Yes, but the user does not see that. Whereas you are forcing the user >>>> to do the stacking in the correct dimensions. >>>> >>>> >>>>> >>>>> >>>>> Now let me come back to my assertion about this behavior *usefulness*. >>>>> You'll acknowledge that np.cov(XY) is made of four blocks (here just 4 >>>>> simple scalars blocks). >>>> No there are not '4' blocks just rows and columns. >>> >>> Sturla showed the 4 blocks in his first message. >>> >> Well, I could not follow that because the code is wrong. >> X = np.array([-2.1, -1. , 4.3]) >>>>> cX = X - X.mean(axis=0)[np.newaxis,:] >> >> Traceback (most recent call last): >> File "<pyshell#6>", line 1, in <module> >> cX = X - X.mean(axis=0)[np.newaxis,:] >> IndexError: 0-d arrays can only use a single () or a list of newaxes >> (and a single ...) as an index >> whoops! >> >> Anyhow, variance-covariance matrix is symmetric but numpy or scipy >> lacks lapac's symmetrix matrix >> (http://www.netlib.org/lapack/explore-html/de/d9e/group___s_y.html) >> >>>> >>>>> * diagonal blocks are just cov(X) and cov(Y) (which in this case comes >>>>> to var(X) and var(Y) when setting ddof to 1) >>>> Sure but variances are still covariances. >>>> >>>>> * off diagonal blocks are symetric and are actually the covariance >>>>> estimate of X, Y observations (from >>>>> http://en.wikipedia.org/wiki/Covariance) >>>> Sure >>>>> >>>>> that is : >>>>>> >> ((X-X.mean()) * (Y-Y.mean())).sum()/ (3-1) >>>>> -4.2860000000000005 >>>>> >>>>> The new proposed behaviour for cov is that cov(X,Y) would return : >>>>> array(-4.2860000000000005) instead of the 2*2 matrix. >>>> >>>> But how you interpret an 2D array where the rows are greater than 2? >>>>>>> Z=Y+X >>>>>>> np.cov(np.vstack((X,Y,Z))) >>>> array([[ 11.71 , -4.286 , 7.424 ], >>>> [ -4.286 , 2.14413333, -2.14186667], >>>> [ 7.424 , -2.14186667, 5.28213333]]) >>>> >>>> >>>>> >>>>> * This would be in line with the cov(X,Y) mathematical definition, as >>>>> well as with R behavior. >>>> I don't care what R does because I am using Python and Python is >>>> infinitely better than R is! >>>> >>>> But I think that is only in the 1D case. >>> >>> I just checked R to make sure I remember correctly >>> >>>> xx = matrix((1:20)^2, nrow=4) >>>> xx >>> [,1] [,2] [,3] [,4] [,5] >>> [1,] 1 25 81 169 289 >>> [2,] 4 36 100 196 324 >>> [3,] 9 49 121 225 361 >>> [4,] 16 64 144 256 400 >>>> cov(xx, 2*xx[,1:2]) >>> [,1] [,2] >>> [1,] 86.0000 219.3333 >>> [2,] 219.3333 566.0000 >>> [3,] 352.6667 912.6667 >>> [4,] 486.0000 1259.3333 >>> [5,] 619.3333 1606.0000 >>>> cov(xx) >>> [,1] [,2] [,3] [,4] [,5] >>> [1,] 43.0000 109.6667 176.3333 243.0000 309.6667 >>> [2,] 109.6667 283.0000 456.3333 629.6667 803.0000 >>> [3,] 176.3333 456.3333 736.3333 1016.3333 1296.3333 >>> [4,] 243.0000 629.6667 1016.3333 1403.0000 1789.6667 >>> [5,] 309.6667 803.0000 1296.3333 1789.6667 2283.0000 >>> >>> >>>> >>>>> * This would save memory and computing resources. (and therefore help >>>>> save the planet ;-) ) >>>> Nothing that you have provided shows that it will. >>> >>> I don't know about saving the planet, but if X and Y have the same >>> number of columns, we save 3 quarters of the calculations, as Sturla >>> also explained in his first message. >>> >> Can not figure those savings: >> For a 2 by 2 output has 3 covariances (so 3/4 =0.75 is 'needed' not 25%) >> a 3 by 3 output has 6 covariances >> a 5 by 5 output 15 covariances > > what numpy calculates are 4, 9 and 25 covariances, we might care only > about 1, 2 and 4 of them. > >> >> If you want to save memory and calculation then use symmetric storage >> and associated methods. > > actually for covariance matrix we stilll need to subtract means, so we > won't save 75%, but we save 75% in the cross-product. > > suppose X and Y are (nobs, k_x) and (nobs, k_y) (means already subtracted) > (and ignoring that numpy "likes" rows instead of columns) > > the partitioned dot product [X,Y]'[X,Y] is > > [[ X'X, X'Y], > [Y'X, Y'Y]] > > X'Y is (n_x, n_y) > total shape is (n_x + n_y, n_x + n_y) > > If we are only interested in X'Y, we don't need the other three submatrices. > > If n_x = 99 and n_y is 1, we save .... ? > (we get a (99,1) instead of a (100, 100) matrix) > > and X'Y , np.dot(X, Y), doesn't have any duplicated symmetry, so > exploiting symmetry is a different issue. > > Josef > >> >> Bruce >> >>> Josef >>> >>>> >>>>> >>>>> However, I do understand that the impact for this change may be big. >>>>> This indeed requires careful reviewing. >>>>> >>>>> Pierre >>>>> _______________________________________________ >>>>> NumPy-Discussion mailing list >>>>> [email protected] >>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>>> >>>> Bruce >>>> _______________________________________________ >>>> NumPy-Discussion mailing list >>>> [email protected] >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> [email protected] >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> _______________________________________________ >> NumPy-Discussion mailing list >> [email protected] >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > _______________________________________________ > NumPy-Discussion mailing list > [email protected] > http://mail.scipy.org/mailman/listinfo/numpy-discussion
Thanks for someone to clearly state what they want. But still lacks evidence that it will save the world - when nobs is large, n_x and n_y are meaningless and thus (99,1) vs (100, 100) is also meaningless. Further dealing separately with the two arrays also bring additional overhead - small not zero. Bruce _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
