Well, to answer my own question - following is how to calculate the standard
deviation
from a variance table. It greatly speeds up my calculation so now there's
only about
a 60-to-1 advantage of the binomial method suggested by John Randall.
NB. Include my old "Adjusted Mean and Standard Deviation" function for
comparison:
adjmsdOld=: (([:mean 1 2 3 4 5+/ .*~]),[:stddev 1 2 3 4 5#~[:<.0.5+1e6*])
NB.* freqmean: calculate mean of elments x. with frequencies y.
freqmean=: 4 : 'x. +/ . * y.'
NB.* freqsd: calculate standard deviation of elments x. with frequencies y.
freqsd=: 4 : '%:(y. +/ . * *:x.)-*:x. freqmean y.'
NB. So, new version of "adjmsd":
adjmsd=: (1 2 3 4 5(freqmean,.freqsd)])
NB. Re-run timings done previously:
6!:2 'mnssds=. adjmsd"1]100{.CPROBS'
0.0048782736
6!:2 'adjustNormalDist"1 mnssds'
0.23120061
NB. Ratio of times for generating standard normal equivalent vs. binomial
approximation:
(+/0.23120061 0.0048782736)%0.0038555179
61.231432
On 1/16/07, Devon McCormick <[EMAIL PROTECTED]> wrote:
John -
this looks like a good alternative. It appears to slightly
under-approximate the variance but
is so much faster (>20,000 x) that it may be worth the trade-off. Now if
only my Bayesian method
would give better results!
Here's my comparison:
NB. Different ways to approximate normal distribution (on 5 scores),
having
NB.(approximately) same mean and standard deviation as empirical data.
NB. Following suggested by John Randall: fit binomial distribution:
meanfreq =:(+/ .* [EMAIL PROTECTED]) % +/ NB.* meanfreq: mean of frequency table
binn =:<:@# NB.* binn: binomial n from
frequency table
binp =:meanfreq % <:@# NB.* binp: binomial p from frequency table
NB.* bindist: binomial dist from freq table
bindist =:[: p. (binp^binn);(binn # [EMAIL PROTECTED])
NB. Versus an iterative estimator:
adjustNormalDist=: 3 : 0
mstarg=. y. [ maxiter=. 20 [ ctr=. 0
'madj sadj'=. mstarg
msrs=. adjmsd sn=. (%+/)(madj+sadj*i:2j4) pdfnc mstarg
while. (1e_5 +./ . <:|msrs-mstarg) *. maxiter>:ctr=. >:ctr do.
'madj sadj'=. (madj,sadj)+msrs-mstarg
msrs=. adjmsd sn=. (%+/)(madj+sadj*i:2j4) pdfnc mstarg
end.
sn
)
NB.* adjmsd: show adjusted mean and SD
adjmsd=: (([:mean 1 2 3 4 5+/ .*~]),[:stddev 1 2 3 4 5#~[:<.0.5+1e6*])
NB.* pdfnc: prob density fnc of normal curve for given SD and mean at
points x.
pdfnc=: 4 : '(%sd*%:o. 2)*^-(*:x.-mn)%+:*:sd [ ''mn sd''=. y.'
crecs=. 1{"1 getUserRecs&>5{.UUIDS NB. Cust recs: movie, cust, rate, dt
mnssds=. (mean,stddev)&>2{&.>crecs NB. adjustNormalDist takes means &
SDs
mnssds;(adjmsd"1 bindist"1]5{.CPROBS);adjmsd"1 adjustNormalDist"1
mnssds
+--------------------+--------------------+--------------------+
|3.4185304 0.83555495|3.4185304 0.97786064|3.4185272 0.83555661|
|4.0113507 0.89942311|4.0113507 0.86272412|4.0113546 0.89942198|
|4.2142857 0.78975397|4.2142857 0.79459397|4.2142815 0.78975647|
|3.3923077 1.0471226 |3.3923077 0.98057379|3.39231 1.0471132 |
|3.4814815 1.2206672 |3.4814815 0.97058952|3.4814534 1.2206439 |
+--------------------+--------------------+--------------------+
NB. Comparison of distribution approximations:
bindist"1]5{.CPROBS
0.024434501 0.14947004 0.34287521 0.34957109 0.13364915
0.0037318916 0.045468236 0.2077392 0.42183858 0.32122209
0.0014887392 0.024361187 0.1494891 0.40769756 0.41696341
0.026095869 0.15532661 0.34669791 0.34393318 0.12794643
0.020770187 0.1357661 0.33279251 0.36255445 0.14811676
adjustNormalDist"1 mnssds
0.0082269678 0.11550491 0.41299218 0.37606586 0.087210083
0.0058771044 0.051522575 0.20903631 0.39249664 0.34106737
0.0012439144 0.02226285 0.15342208 0.40711008 0.41596107
0.038271035 0.16036229 0.32599857 0.32152189 0.15384622
0.07274651 0.15352443 0.24164998 0.28368735 0.24839172
NB. Timings: bindist requires only the probability distribution:
6!:2 'bindist"1]100{.CPROBS'
0.0038555179
NB. Versus:
6!:2 'crecs=. 1{"1 getUserRecs&>100{.UUIDS'
9.9790119
6!:2 'adjustNormalDist"1 (mean,stddev)&>2{&.>crecs'
77.649548
(+/77.649548 9.9790119)%0.0038555179
22728.091 NB. Ratio of times
NB. Alternate use of "adjustNormalDist" works w/ests of the prob dists:
6!:2 'mnssds=. adjmsd"1]100{.CPROBS'
5.5266708
6!:2 'adjustNormalDist"1 mnssds'
77.952477
(+/77.952477 5.5266708)%0.0038555179
21651.864
NB. Little difference in relative times
On 1/16/07, John Randall <[EMAIL PROTECTED]> wrote:
>
> Devon McCormick wrote:
> > What I'd like to do is to construct an equivalent 5-element
> > distribution
> > with the same mean and standard deviation but (more or less) normally
> > distributed.
>
> How about fitting a binomial distribution to the data? If a frequency
> table on i.(n+1) has mean m, the binomial distribution with
> generating function (q+px)^n has the same mean if np=m.
>
> mf =:(+/ .* [EMAIL PROTECTED]) % +/ NB. mean of frequency table
> n =:<:@# NB. binomial n from frequency table
> p =:mf % <:@# NB. binomial p from frequency table
> b =:[: p. (p^n);(n # [EMAIL PROTECTED]) NB. binomial dist from freqency table
>
> d =:0 0.13333333 0.4 0.46666667 0 NB. data
> b d NB. binomial with same mean
> 0.0301408 0.168789 0.354456 0.330826 0.115789
> mf d
> 2.33333
> mf b d
> 2.33333
>
> Best wishes,
>
> John
>
>
>
>
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
--
Devon McCormick
^me^ at acm.
org is my
preferred e-mail
--
Devon McCormick
^me^ at acm.
org is my
preferred e-mail
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm