I don’t know shucks about statistics, but I thought I would toy with some different random distributions to see how they behaved. Contrary to my expectations, the sample median was never the more consistent measure, and was often worse.
#!/usr/bin/python """Different measures of central tendency have different variability. And it depends on the distribution of the underlying data. Some distributions are easy to characterize from a random sample. If your data is normally or uniformly distributed, you can get a pretty good estimate of the mean, which is also the median, after just ten or twenty data points. But some other distributions are not so well-behaved. The exponential distribution is a common one. Its median is well to the left of its mean. Does the sample mean or the sample median have greater variance? I hypothesize, without actually doing the math, that the sample mean of an exponential distribution has proportionally greater variance, and therefore the sample median is a better measure to use, if you have to pick one. """ from __future__ import division import random, math, sys sample_mean = lambda sample: sum(sample)/len(sample) sample_means = lambda samples: map(sample_mean, samples) # wrong for even samples, but close enough: sample_median = lambda sample: sorted(sample)[len(sample)//2] sample_medians = lambda samples: map(sample_median, samples) uniform_sample = lambda n: [random.uniform(0, 1) for ii in range(n)] expo_sample = lambda n: [random.expovariate(1) for ii in range(n)] def standard_deviation(sample): mean = sample_mean(sample) return math.sqrt(sum((x - mean)**2 for x in sample)/(len(sample)-1)) uniform_samples = lambda n, m: [uniform_sample(m) for ii in range(n)] expo_samples = lambda n, m: [expo_sample(m) for ii in range(n)] def compare(n, m): print "%d samples of %d items each:" % (n, m) print "Uniform:", describe(uniform_samples(n, m)) print "Exponential:", describe(expo_samples(n, m)) def describe(samples): means, medians = sample_means(samples), sample_medians(samples) print "standard deviation of mean %.2f (mean mean %.2f), of median %.2f (mean median %.2f)" % (standard_deviation(means), sample_mean(means), standard_deviation(medians), sample_mean(medians)) if __name__ == '__main__': compare(int(sys.argv[1]), int(sys.argv[2])) (End of `variance.py`.) Example usage: : kra...@inexorable:~/devel/inexorable-misc ; ./variance.py 10000 20 10000 samples of 20 items each: Uniform: standard deviation of mean 0.06 (mean mean 0.50), of median 0.11 (mean median 0.52) Exponential: standard deviation of mean 0.22 (mean mean 1.00), of median 0.24 (mean median 0.77) This software is available via git clone http://canonical.org/~kragen/sw/inexorable-misc.git (or in <http://canonical.org/~kragen/sw/inexorable-misc>) in the file `variance.py`. Like everything else posted to kragen-hacks without a notice to the contrary, this software is in the public domain. -- To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-hacks