Re: [Jprogramming] Welford's method for standard deviations

Mike Day Fri, 16 Sep 2011 04:24:59 -0700

Apologies for misspelling Viktor's name.
Yes - I've had a look at the more scalar approach.


FWIW,  The best explicit form I've come up with is :

kMS =: {:@:(kMS2/)@:,&0  NB. appending zero sets k0=M0=S0=0

kMS2 =: 3 : 0
:
k   =. >:  km1 =. <.{.y          NB. k, k-1
dx  =. x - 1{y,0                 NB. x - Mk-1
y + 1, (1, km1) * (dx ^ 1 2) % k NB. do it!
)

A tacit expression which does the same job as
kMS2 is

kMSt  =: {:@:(kMS2t/)@:,&0

kMSt2=:13 : 'y + 1, (x - 1{y,0) ((1 2^~[) * (1&,%>:)@]) <.{.y'

Performance isn't great: x is ?1000#0 -

    timer'kMS 1e9+ 1000000$x'
+-------+-------+
|15.8811|82981.7|
+-------+-------+
    timer'kMSt 1e9+ 1000000$x'  NB. tacit version
+-------+-------+
|7.61304|82981.7|
+-------+-------+
    timer'Sk   1e9+ 1000000$x'  NB. my earlier contribution
+-------+-------+
|5.50684|82981.7|
+-------+-------+
    timer'ssdev 1e9+ 1000000$x' NB. statfns.ijs
+--------+-------+
|0.171143|82981.7|
+--------+-------+

   Stability is not ideal but less compromised for kMS
than for ssdev for a larger array:
    timer'ssdev 10000000$x' NB. statfns.ijs
+-------+------+
|1.49707|829817|
+-------+------+
    timer'kMSt 10000000$x'  NB. tacit version
+---------+------+
|1 15.1301|829817|
+---------+------+
NB. ok when mean is ~0

    timer'ssdev 1e9+ 10000000$x' NB. statfns.ijs
+-------+------+
|1.60718|859034|
+-------+------+

    timer'Sk   1e9+ 10000000$x'  NB. my earlier contribution
|out of memory: makexM
|       Sk 1000000000+10000000$x

    timer'kMSt 1e9+ 10000000$x'  NB. tacit version
+--------+------+
|1 15.052|829821|
+--------+------+
NB. less good when mean ~ 1e9

I hadn't heard of Welford when I reinvented this wheel
a few years ago, for different reasons:  if you only
know the mean, standard deviation and sample size for
two (or more) different samples,  I used an analogous
technique to estimate the mean & std were those samples
to be combined,  assuming of course that their
underlying distributions were the same.

But if you do have all the data,  I still think that
the simplest approach is satisfactory.  If it isn't,
in a real statistical investigation, I would look for
outliers,  try transforming the data (eg taking logs),
and generally wonder why it's unstable.

Mike

On 15/09/2011 5:33 PM, Viktor Cerovski wrote:
>
> Mike Day-3 wrote:
>> Here's a somewhat J-ish way to do the job.  It's ok for small data
>> sets but falls down on a few million cases.
>>
>> NB. given x,  append a reversed column of k, then append a row: 0 0
>> NB. If you like, this reverts to origin zero, setting M0 = 0
>> NB. x gets reversed on the way,  but I don't think that usually matters
>> NB. If it does matter,  just do Mk on |.x
>>
>> makexM =: 0 ,~ ] ,.~ [:>: [: i. [: - #
>>
>> NB. carry out Mk-1 + (xk - Mk-1)/k on (xk, xk-1,. Mk, Mk-1)
>> NB. and prepend xk to produce xk, Mk
>>
>>      mk =: ({:@] , {:@[ - -&{: % {.@])
>>
>> NB. Put these verbs together and clean up the resulting 2 column
>> NB. array of (reversed) x and Mk on reversed x:
>>
>>      Mk =:  }.@: |.@: (mk ~/\.) @: makexM
>>
>> NB. diffs xk - Mk
>>
>>      difxk_Mk =: -/"1
>>
>> NB. diffs xk - Mk-1
>>
>>      difxk_Mkm1 =: }:@:(-~//.)
>>
>> NB. combine these for Sk:
>>
>>      Sk =: +/@: (difxk_Mk * difxk_Mkm1) @: Mk
>>
>> NB. eg given x5
>>      x5
>>
>> 0.971379 0.0466292 0.330109 0.01346 0.220004
>>
>>      Mk x5
>>
>>    0.220004 0.220004
>>     0.01346 0.116732
>>    0.330109 0.187858
>> 0.0466292 0.152551
>>    0.971379 0.316316
>>
>>      Sk x5
>>
>> 0.603026
>>
>>      ssdev x5  NB. cf stats.ijs verb
>>
>> 0.603026
>>
> Looks interesting to me, I'll study it more when I get the chance.
>
>
>
>> It's a bit quicker than Victor's Wstd.
>>
>> However, Sk ran out of space on his
>>      a=:1e9+2e7?@$0;
>> It does run with
>>      a=:1e9+5e6?@$0,
>> but is then much slower than ssdev (about 60
>> times!).  On my machine, Wstd is another 4
>> times slower than that.   All 3 verbs gave
>> the same sum of squares for this value of a .
>>
> Wstd can be sped up by simply rewriting it
> as a for. loop, getting a standard imperative-style
> program.  The speedup should be considerable since
> there will be no boxing/unboxing being done for
> every element of x, and because of the simpler iteration.
[snipped the rest - Mike]

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Welford's method for standard deviations

Reply via email to