After some offline discussion with Fraser Jackson and some more investigation I've come up with a altered proposal that is both simpler as well as more coherent and powerful.
This proposal retains one definition of cov and corr, but they now support both the current dyadic behaviour, as well as the monadic, multivariate, matrix behaviour from the last proposal. XtY=: +/ .*~ |: NB. monadic is XtX spdev=: XtY&dev NB. replaces: +/@(*~ dev) cov=: spdev % <:@#@] NB. no change to current definition corr=: cov % */~&stddev NB. replaces: cov % *&stddev X=: 1 1 1 1 2 2 2 2 3 3 3 3 Y=: 1 2 2 3 5 5 6 7 10 11 11 12 Z=: 1 1 2 2 4 6 5 4 8 7 9 10 W=: X,.Y,.Z X cov Y 3.27273 cov X,.Y 0.727273 3.27273 3.27273 15.4773 cov W 0.727273 3.27273 2.54545 3.27273 15.4773 11.6591 2.54545 11.6591 9.7197 X corr Y 0.97547 corr X,.Y 1 0.97547 0.97547 1 corr W 1 0.97547 0.957393 0.97547 1 0.950585 0.957393 0.950585 1 A slightly more performant version of correlation is possible that doesn't work dyadically, but is also numerically more precise. diag=: (<0 1)&|: cov2corr=: % */~@:%:@diag corrm=: cov2corr@cov V=: 100000 100 ?@$ 0 They don't give quite the same answer (but are equivalent to 14 decimal places) >./ ,|(corr - corrm) V 2.42029e_14 The reason they differ is that the following 2 definitions of sum of squares do not give exactly the same value (diag@(+/ .*~ |:) - +/@:*:) dev 100 5 ?.@$ 0 _1.77636e_15 1.77636e_15 5.32907e_15 _1.77636e_15 0 (10 timespacex 'corr V') % (10 timespacex 'corrm V') 1.19102 1.00049 If the numerical precision and small speed and space improvements are of interest then corrm could be included in addition to the suggested changes. (The speed differences are less prior to the avx improvements in 806) On Thu, Jul 13, 2017 at 12:42 PM, Ric Sherlock <tikk...@gmail.com> wrote: > The current verbs for calculating covariance and correlation in the > stats/base/multivariate.ijs script, are dyadic and designed to calculate > the cov/corr between 2 variables > e.g. > load 'stats' > X=: 1 1 1 1 2 2 2 2 3 3 3 3 > Y=: 1 2 2 3 5 5 6 7 10 11 11 12 > Z=: 1 1 2 2 4 6 5 4 8 7 9 10 > X cov Y > 3.27273 > X corr Y > 0.97547 > > Often we want to calculate a cov/corr matrix for more than 2 variables. > The current definitions can be used this for purpose > > > > cov"1/~ X,Y,:Z > > 0.727273 3.27273 2.54545 > > 3.27273 15.4773 11.6591 > > 2.54545 11.6591 9.7197 > > > corr"1/~ X,Y,:Z > 1 0.97547 0.957393 > > 0.97547 1 0.950585 > > 0.957393 0.950585 1 > > > but they are slower than these alternatives > ((+/ .*~ |:)@dev % <:@#) X,.Y,.Z > ((+/ .*~ |:)@(dev %"_1 _ stddev) % <:@#) X,.Y,.Z > > > This topic has come up in the forums at least a couple of times. > http://www.jsoftware.com/pipermail/programming/2011-March/022417.html > http://www.jsoftware.com/pipermail/programming/2009-September/016321.html > http://www.jsoftware.com/pipermail/programming/2007-June/007186.html > > I propose to add the following definitions to the > stats/base/multivariate.ijs script. Any algorithmic or naming suggestions > are welcome. > > XtX=: |: +/ .* ] > cov_multi=: XtX@dev % <:@# > corr_multi=: XtX@(dev %"_1 _ stddev) % <:@# > > Note that my testing suggests the fork (|: +/ .* ]) appears to be slightly > faster and leaner than equivalent hook (+/ .*~ |:) > > > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm