Members of the Forum -

let's say I'm looking at Netflix ratings as probability distributions
instead of as point estimates. So, if customer 692 has these ratings:

  'rc cr692'=. getUserRecs 692
  2{cr692                               NB. The ratings
2 4 4 3 3 3 4 2 3 4 4 3 4 3 4

The frequency table looks like this:

  frtab 2{cr692 NB. counts ,. values
2 2
6 3
7 4

[ where
frtab=: 3 : 0
  y.=. y./:y.
  difs=. 2-~/\(#y.),~I. 1,2~:/\y.
  if. -.isNum y. do. difs=. <"0 difs [ y.=. <"1 ,.y. end.
  difs,.~.y.
)
]

Converting these to probabilities for ratings of >:i.5:

  ]pr692=. ([:(%+/)0{1 0-~[:|:[:frtab 1 2 3 4 5,]) 2{cr692
0 0.13333333 0.4 0.46666667 0

[I concatenate 1 2 3 4 5 to ensure an entry for any missing rating
and (%+/) to make the probabilities sum to one.]

Thus, customer 692 has given a rating of "3" 40% of the time and
a rating of "4" about 47% of the time.

This distribution has a mean and standard deviation:
  (mean,stddev) 2{cr692
3.3333333 0.72374686

Alternate mean calculation from the probability vector:
  pr692 +/ . * >:i.5
3.3333333

What I'd like to do is to construct an equivalent 5-element distribution
with the same mean and standard deviation but (more or less) normally
distributed.

I can easily do this for a standard normal (mean 0 and SD 1):

NB.* pdfnc: probability density fnc for normal curve w/given mean and SD.
pdfnc=: 4 : '(%sd*%:o. 2)*^-(*:x.-mn)%+:*:sd [ ''mn sd''=. y.'

Assuming the end-points are two standard deviations ((i:2j4) -: _2 _1 0 1 2)

from the mean:

  ]sn=. (%+/)(i:2j4) pdfnc 0 1
0.054488685 0.24420134 0.40261995 0.24420134 0.054488685

This distribution "sn" has mean of 3
  sn +/ . * >:i.5
3

and an approximate standard deviation of
  stddev (<.0.5+1e6*sn)#>:i.5
0.96141298

This is slightly less than one because I adjusted the distribution
using (%+/) to force summation to one. I'm sure there's a more exact,
analytic way to calculate the standard deviation but this works well
enough for now and I'm mostly concerned with the mean.

I can see an iterative way to get where I want:

NB. First, adjust the mean:
  (sn=. (%+/)(_0.3+i:2j4) pdfnc 0 1)+/ . * >:i.5
3.2758083
  (sn=. (%+/)(_0.35+i:2j4) pdfnc 0 1)+/ . * >:i.5
3.3211532
  (sn=. (%+/)(_0.37+i:2j4) pdfnc 0 1)+/ . * >:i.5
3.3392133
. . .
  (sn=. (%+/)(_0.3635+i:2j4) pdfnc 0 1)+/ . * >:i.5
3.3333489

NB. Now work on the standard deviation:
  stddev (<.0.5+1e7*sn=. (%+/)(_0.3635+1.1*i:2j4) pdfnc 0 1)#>:i.5
0.884115
  stddev (<.0.5+1e7*sn=. (%+/)(_0.3635+1.2*i:2j4) pdfnc 0 1)#>:i.5
0.82181792
  stddev (<.0.5+1e7*sn=. (%+/)(_0.3635+1.5*i:2j4) pdfnc 0 1)#>:i.5
0.66592555
  stddev (<.0.5+1e7*sn=. (%+/)(_0.3635+1.4*i:2j4) pdfnc 0 1)#>:i.5
0.71245182

NB. Of course, this throws off the mean:
  sn +/ . * >:i.5
3.2584499
  adjmsd=: (([:mean 1 2 3 4 5+/ .*~]),[:stddev 1 2 3 4 5#~[:<.0.5+1e7*])
 NB. Combine target measures...
. . .
  adjmsd sn=. (%+/)(_0.461+1.376*i:2j4) pdfnc 0 1
3.3331422 0.7238559
NB. Not too bad compared to:
  (mean,stddev) 2{cr692
3.3333333 0.72374686

This is probably workable but there must be an analytic solution,
probably a fairly straightforward one.

Any ideas?

--
Devon McCormick
^me^ at acm.
org is my
preferred e-mail
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to