Members of the Forum -
let's say I'm looking at Netflix ratings as probability distributions
instead of as point estimates. So, if customer 692 has these ratings:
'rc cr692'=. getUserRecs 692
2{cr692 NB. The ratings
2 4 4 3 3 3 4 2 3 4 4 3 4 3 4
The frequency table looks like this:
frtab 2{cr692 NB. counts ,. values
2 2
6 3
7 4
[ where
frtab=: 3 : 0
y.=. y./:y.
difs=. 2-~/\(#y.),~I. 1,2~:/\y.
if. -.isNum y. do. difs=. <"0 difs [ y.=. <"1 ,.y. end.
difs,.~.y.
)
]
Converting these to probabilities for ratings of >:i.5:
]pr692=. ([:(%+/)0{1 0-~[:|:[:frtab 1 2 3 4 5,]) 2{cr692
0 0.13333333 0.4 0.46666667 0
[I concatenate 1 2 3 4 5 to ensure an entry for any missing rating
and (%+/) to make the probabilities sum to one.]
Thus, customer 692 has given a rating of "3" 40% of the time and
a rating of "4" about 47% of the time.
This distribution has a mean and standard deviation:
(mean,stddev) 2{cr692
3.3333333 0.72374686
Alternate mean calculation from the probability vector:
pr692 +/ . * >:i.5
3.3333333
What I'd like to do is to construct an equivalent 5-element distribution
with the same mean and standard deviation but (more or less) normally
distributed.
I can easily do this for a standard normal (mean 0 and SD 1):
NB.* pdfnc: probability density fnc for normal curve w/given mean and SD.
pdfnc=: 4 : '(%sd*%:o. 2)*^-(*:x.-mn)%+:*:sd [ ''mn sd''=. y.'
Assuming the end-points are two standard deviations ((i:2j4) -: _2 _1 0 1 2)
from the mean:
]sn=. (%+/)(i:2j4) pdfnc 0 1
0.054488685 0.24420134 0.40261995 0.24420134 0.054488685
This distribution "sn" has mean of 3
sn +/ . * >:i.5
3
and an approximate standard deviation of
stddev (<.0.5+1e6*sn)#>:i.5
0.96141298
This is slightly less than one because I adjusted the distribution
using (%+/) to force summation to one. I'm sure there's a more exact,
analytic way to calculate the standard deviation but this works well
enough for now and I'm mostly concerned with the mean.
I can see an iterative way to get where I want:
NB. First, adjust the mean:
(sn=. (%+/)(_0.3+i:2j4) pdfnc 0 1)+/ . * >:i.5
3.2758083
(sn=. (%+/)(_0.35+i:2j4) pdfnc 0 1)+/ . * >:i.5
3.3211532
(sn=. (%+/)(_0.37+i:2j4) pdfnc 0 1)+/ . * >:i.5
3.3392133
. . .
(sn=. (%+/)(_0.3635+i:2j4) pdfnc 0 1)+/ . * >:i.5
3.3333489
NB. Now work on the standard deviation:
stddev (<.0.5+1e7*sn=. (%+/)(_0.3635+1.1*i:2j4) pdfnc 0 1)#>:i.5
0.884115
stddev (<.0.5+1e7*sn=. (%+/)(_0.3635+1.2*i:2j4) pdfnc 0 1)#>:i.5
0.82181792
stddev (<.0.5+1e7*sn=. (%+/)(_0.3635+1.5*i:2j4) pdfnc 0 1)#>:i.5
0.66592555
stddev (<.0.5+1e7*sn=. (%+/)(_0.3635+1.4*i:2j4) pdfnc 0 1)#>:i.5
0.71245182
NB. Of course, this throws off the mean:
sn +/ . * >:i.5
3.2584499
adjmsd=: (([:mean 1 2 3 4 5+/ .*~]),[:stddev 1 2 3 4 5#~[:<.0.5+1e7*])
NB. Combine target measures...
. . .
adjmsd sn=. (%+/)(_0.461+1.376*i:2j4) pdfnc 0 1
3.3331422 0.7238559
NB. Not too bad compared to:
(mean,stddev) 2{cr692
3.3333333 0.72374686
This is probably workable but there must be an analytic solution,
probably a fairly straightforward one.
Any ideas?
--
Devon McCormick
^me^ at acm.
org is my
preferred e-mail
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm