Brian -
thanks for your response. I guess I wasn't clear on what
I'm looking for: an approximation of a normal distribution
that has the same mean and standard deviation as the
actual distribution which will, in general, be non-normal.
In the example I gave, the empirical distribution is:
0 0.13333333 0.4 0.46666667 0
What I'm looking for is an approximately normal version of this
distribution,
for instance something like:
0.0031445589 0.10158411 0.49394342 0.36150472 0.039823194
which I calculated with an iterative, approximation method.
Both of these have approximately the same mean and standard deviation
but the empirical one is probably non-normal.
A better example may be one with made up data:
]actd=. r2p 25 25#1 5 NB. Actual distribution
0.5 0 0 0 0.5
]apnord=. adjustNormalDist (mean,stddev) 25 25#1 5 NB. Approx normal
0.054488685 0.24420134 0.40261995 0.24420134 0.054488685
adjmsd&>actd;apnord
3 2.000001
3 0.96141298
These have the same mean but the standard deviation cannot be made equal
(so my approximation does some maximum number of iterations before it gives
up).
This approximation method may be good enough for my purposes.
"freqcount" may be a nice shortcut for the more general
"frtab" function I showed which also handles non-numeric
arguments.
On 1/15/07, Brian Schott <[EMAIL PROTECTED]> wrote:
Devon,
I don't know why you are doing all this, but let me
observe that your final answer is simply the mean and
standard deviation of your original sample.
...
Btw, another way to get that distribution is as
follows.
freqcount=: (/: {:"1)@(~. ,.~ #/.~)
freqcount 2 4 4 3 3 3 4 2 3 4 4 3 4 3 4
2 2
6 3
7 4
On Mon, 15 Jan 2007, Devon McCormick wrote:
+ Members of the Forum -
+
+ let's say I'm looking at Netflix ratings as probability distributions
+ instead of as point estimates. So, if customer 692 has these ratings:
+
+ 'rc cr692'=. getUserRecs 692
+ 2{cr692 NB. The ratings
+ 2 4 4 3 3 3 4 2 3 4 4 3 4 3 4
+
+ The frequency table looks like this:
+
+ frtab 2{cr692 NB. counts ,. values
+ 2 2
+ 6 3
+ 7 4
+
+ [ where
+ frtab=: 3 : 0
+ y.=. y./:y.
+ difs=. 2-~/\(#y.),~I. 1,2~:/\y.
+ if. -.isNum y. do. difs=. <"0 difs [ y.=. <"1 ,.y. end.
+ difs,.~.y.
+ )
+ ]
+
+ Converting these to probabilities for ratings of >:i.5:
+
+ ]pr692=. ([:(%+/)0{1 0-~[:|:[:frtab 1 2 3 4 5,]) 2{cr692
+ 0 0.13333333 0.4 0.46666667 0
+
+ [I concatenate 1 2 3 4 5 to ensure an entry for any missing rating
+ and (%+/) to make the probabilities sum to one.]
+
+ Thus, customer 692 has given a rating of "3" 40% of the time and
+ a rating of "4" about 47% of the time.
+
+ This distribution has a mean and standard deviation:
+ (mean,stddev) 2{cr692
+ 3.3333333 0.72374686
+
+ Alternate mean calculation from the probability vector:
+ pr692 +/ . * >:i.5
+ 3.3333333
+
+ What I'd like to do is to construct an equivalent 5-element distribution
+ with the same mean and standard deviation but (more or less) normally
+ distributed.
+
+ I can easily do this for a standard normal (mean 0 and SD 1):
+
+ NB.* pdfnc: probability density fnc for normal curve w/given mean and
SD.
+ pdfnc=: 4 : '(%sd*%:o. 2)*^-(*:x.-mn)%+:*:sd [ ''mn sd''=. y.'
+
+ Assuming the end-points are two standard deviations ((i:2j4) -: _2 _1 0
1 2)
+
+ from the mean:
+
+ ]sn=. (%+/)(i:2j4) pdfnc 0 1
+ 0.054488685 0.24420134 0.40261995 0.24420134 0.054488685
+
+ This distribution "sn" has mean of 3
+ sn +/ . * >:i.5
+ 3
+
+ and an approximate standard deviation of
+ stddev (<.0.5+1e6*sn)#>:i.5
+ 0.96141298
+
+ This is slightly less than one because I adjusted the distribution
+ using (%+/) to force summation to one. I'm sure there's a more exact,
+ analytic way to calculate the standard deviation but this works well
+ enough for now and I'm mostly concerned with the mean.
+
+ I can see an iterative way to get where I want:
+
+ NB. First, adjust the mean:
+ (sn=. (%+/)(_0.3+i:2j4) pdfnc 0 1)+/ . * >:i.5
+ 3.2758083
+ (sn=. (%+/)(_0.35+i:2j4) pdfnc 0 1)+/ . * >:i.5
+ 3.3211532
+ (sn=. (%+/)(_0.37+i:2j4) pdfnc 0 1)+/ . * >:i.5
+ 3.3392133
+ . . .
+ (sn=. (%+/)(_0.3635+i:2j4) pdfnc 0 1)+/ . * >:i.5
+ 3.3333489
+
+ NB. Now work on the standard deviation:
+ stddev (<.0.5+1e7*sn=. (%+/)(_0.3635+1.1*i:2j4) pdfnc 0 1)#>:i.5
+ 0.884115
+ stddev (<.0.5+1e7*sn=. (%+/)(_0.3635+1.2*i:2j4) pdfnc 0 1)#>:i.5
+ 0.82181792
+ stddev (<.0.5+1e7*sn=. (%+/)(_0.3635+1.5*i:2j4) pdfnc 0 1)#>:i.5
+ 0.66592555
+ stddev (<.0.5+1e7*sn=. (%+/)(_0.3635+1.4*i:2j4) pdfnc 0 1)#>:i.5
+ 0.71245182
+
+ NB. Of course, this throws off the mean:
+ sn +/ . * >:i.5
+ 3.2584499
+ adjmsd=: (([:mean 1 2 3 4 5+/ .*~]),[:stddev 1 2 3 4
5#~[:<.0.5+1e7*])
+ NB. Combine target measures...
+ . . .
+ adjmsd sn=. (%+/)(_0.461+1.376*i:2j4) pdfnc 0 1
+ 3.3331422 0.7238559
+ NB. Not too bad compared to:
+ (mean,stddev) 2{cr692
+ 3.3333333 0.72374686
+
+ This is probably workable but there must be an analytic solution,
+ probably a fairly straightforward one.
+
+ Any ideas?
+
+ --
+ Devon McCormick
+ ^me^ at acm.
+ org is my
+ preferred e-mail
+ ----------------------------------------------------------------------
+ For information about J forums see http://www.jsoftware.com/forums.htm
+
(B=) <----------my "sig"
Brian Schott
Atlanta, GA, USA
schott DOT bee are eye eh en AT gee em ae eye el DOT com
http://schott.selfip.net/~brian/
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
--
Devon McCormick
^me^ at acm.
org is my
preferred e-mail
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm