Thanks to both Scott and Matthew for your detailed discussion. I am not
concerned about the speed of the algorithm as its relatively simple math. The
conceptual points that are made are worth more investigation but given the
nature of the problem
it will require a bit of time to think this through and also to run the tweaked
algorithm on our sample data. So I mark this as triaged. I think its a
fascinating problem and deserves some more investigation even if the result
might be that we just keep
our current approach.
** Changed in: software-center (Ubuntu)
Importance: Undecided => Medium
** Changed in: software-center (Ubuntu)
Status: New => Triaged
--
You received this bug notification because you are a member of Desktop
Packages, which is subscribed to software-center in Ubuntu.
https://bugs.launchpad.net/bugs/894468
Title:
Statistics algorithm for sorting ratings looks fishy
Status in “software-center” package in Ubuntu:
Triaged
Bug description:
Here's the current code snippet for sorting the Software Center
Ratings:
def wilson_score(pos, n, power=0.2):
if n == 0:
return 0
z = pnormaldist(1-power/2)
phat = 1.0 * pos / n
return (phat + z*z/(2*n) - z *
math.sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)
def calc_dr(ratings, power=0.1):
'''Calculate the dampened rating for an app given its collective
ratings'''
if not len(ratings) == 5:
raise AttributeError('ratings argument must be a list of 5 integers')
tot_ratings = 0
for i in range (0,5):
tot_ratings = ratings[i] + tot_ratings
sum_scores = 0.0
for i in range (0,5):
ws = wilson_score(ratings[i], tot_ratings, power)
sum_scores = sum_scores + float((i+1)-3) * ws
return sum_scores + 3
This looks very fishy to me, as we are calculating 5 different wilson
scores per rating and summing them. This is slow, and probably wrong.
I'm not 100% sure about what the right method to use is, however I did
find the question asked on Math Overflow:
http://mathoverflow.net/questions/20727/generalizing-the-wilson-score-
confidence-interval-to-other-distributions
The current answer there suggests using a standard normal distribution
for large samples, and a T-distribution for low ones (we don't do
either)
This website suggests a slightly different Wilson algorithm:
http://www.goproblems.com/test/wilson/wilson.php?v1=0&v2=0&v3=3&v4=2&v5=4
I will go further, and assert that we are making a conceptual error in trying
to estimate a mean rating in the first place: ratings are fundamentally ordinal
data, and thus a mean doesn't make much sense for the same reason that
"excellent" + "terrible" does not balance out to "mediocre". However, taking
medians and percentile data is very much valid measurement.
I will research this question a bit more, and probably post a
question on the beta stats stackexchange site for advice.
Intuitively, though, I think we may want to have a ratings algorithm
that sorts primarily based on median, and then for the large number of
cases where two apps have the same median (since we only have 5
ratings), we then compute a wilson score for the lower bound of the
probability that a rater of App A would rate >= median vs < median.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/software-center/+bug/894468/+subscriptions
--
Mailing list: https://launchpad.net/~desktop-packages
Post to : [email protected]
Unsubscribe : https://launchpad.net/~desktop-packages
More help : https://help.launchpad.net/ListHelp