I don’t know shucks about statistics, but I thought I would toy with
some different random distributions to see how they behaved. Contrary
to my expectations, the sample median was never the more consistent
measure, and was often worse.

#!/usr/bin/python
"""Different measures of central tendency have different variability.

And it depends on the distribution of the underlying data.

Some distributions are easy to characterize from a random sample.  If
your data is normally or uniformly distributed, you can get a pretty
good estimate of the mean, which is also the median, after just ten or
twenty data points.

But some other distributions are not so well-behaved. The exponential
distribution is a common one. Its median is well to the left of its
mean.  Does the sample mean or the sample median have greater
variance? I hypothesize, without actually doing the math, that the
sample mean of an exponential distribution has proportionally greater
variance, and therefore the sample median is a better measure to use,
if you have to pick one.

"""

from __future__ import division
import random, math, sys

sample_mean = lambda sample: sum(sample)/len(sample)
sample_means = lambda samples: map(sample_mean, samples)

# wrong for even samples, but close enough:
sample_median = lambda sample: sorted(sample)[len(sample)//2] 
sample_medians = lambda samples: map(sample_median, samples)

uniform_sample = lambda n: [random.uniform(0, 1) for ii in range(n)]
expo_sample = lambda n: [random.expovariate(1) for ii in range(n)]

def standard_deviation(sample):
    mean = sample_mean(sample)
    return math.sqrt(sum((x - mean)**2 for x in sample)/(len(sample)-1))

uniform_samples = lambda n, m: [uniform_sample(m) for ii in range(n)]
expo_samples = lambda n, m: [expo_sample(m) for ii in range(n)]

def compare(n, m):
    print "%d samples of %d items each:" % (n, m)
    print "Uniform:",
    describe(uniform_samples(n, m))
    print "Exponential:",
    describe(expo_samples(n, m))

def describe(samples):
    means, medians = sample_means(samples), sample_medians(samples)
    
    print "standard deviation of mean %.2f (mean mean %.2f), of median %.2f 
(mean median %.2f)" % (standard_deviation(means), sample_mean(means),
                                                                                
                   standard_deviation(medians), sample_mean(medians))

if __name__ == '__main__':
    compare(int(sys.argv[1]), int(sys.argv[2]))

(End of `variance.py`.)

Example usage:

    : kra...@inexorable:~/devel/inexorable-misc ; ./variance.py 10000 20
    10000 samples of 20 items each:
    Uniform: standard deviation of mean 0.06 (mean mean 0.50), of median 0.11 
(mean median 0.52)
    Exponential: standard deviation of mean 0.22 (mean mean 1.00), of median 
0.24 (mean median 0.77)

This software is available via

    git clone http://canonical.org/~kragen/sw/inexorable-misc.git

(or in <http://canonical.org/~kragen/sw/inexorable-misc>) in the file
`variance.py`.

Like everything else posted to kragen-hacks without a notice to the
contrary, this software is in the public domain.

-- 
To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-hacks

Reply via email to