Here are my suggestions:

- While running bench marks don't listen to music, watch videos, use the 
keyboard/mouse, or run anything other than the bench mark code.  Seams like 
common sense to me.

- I would average the timings of runs instead of taking the minimum value as 
sometimes bench marks could be running code that is not deterministic in its 
calculations (could be using random numbers that effect convergence).

- Before calculating the average number I would throw out samples outside 3 
sigmas (the outliers).  This would eliminate the samples that are out of wack 
due to events that are out of our control.  To use this approach it would be 
necessary to run some minimum number of times.  I believe 30-40 samples would 
be necessary but I'm no expert in statistics.  I base this on my recollection  
of a study on this I did some time in the late 90s.  I use to have a better 
feel for the number of samples that is required based on the number of sigmas 
that is used to determine the outliers but I have to confess that I just 
normally use a minimum of 100 samples to play it safe.  I'm sure with a little 
experimentation with bench marks the proper number of samples could be 
determined.

Here is a passage I found at 
http://www.statsoft.com/textbook/stbasic.html#Correlationsf that is related.

'''Quantitative Approach to Outliers. Some researchers use quantitative methods 
to exclude outliers. For example, they exclude observations that are outside 
the range of �2 standard deviations (or even �1.5 sd's) around the group or 
design cell mean. In some areas of research, such "cleaning" of the data is 
absolutely necessary. For example, in cognitive psychology research on reaction 
times, even if almost all scores in an experiment are in the range of 300-700 
milliseconds, just a few "distracted reactions" of 10-15 seconds will 
completely change the overall picture. Unfortunately, defining an outlier is 
subjective (as it should be), and the decisions concerning how to identify them 
must be made on an individual basis (taking into account specific experimental 
paradigms and/or "accepted practice" and general research experience in the 
respective area). It should also be noted that in some rare cases, the relative 
frequency of outliers across a number of groups or cells!
  of a d
esign can be subjected to analysis and provide interpretable results. For 
example, outliers could be indicative of the occurrence of a phenomenon that is 
qualitatively different than the typical pattern observed or expected in the 
sample, thus the relative frequency of outliers could provide evidence of a 
relative frequency of departure from the process or phenomenon that is typical 
for the majority of cases in a group.'''

Now I personally feel that using 1.5 or 2 sigma approach is rather loose for 
the case of bench marks and the suggestion I gave of 3 might be too tight.  
From experimentation we might find that 2.5 is more appropriate. I usually use 
this approach while reviewing data obtained by fairly accurate sensors so being 
being conservative using 3 sigmas works well for these cases.

The last statement in the passage is worthy to note as a high ratio of outliers 
could be used as an indication that the bench mark results for a particular run 
are invalid.  This could be used to throw out bad results due to some one 
starting to listen to music while the bench marks are running, anti virus 
software starts to run, etc.

- Another improvement to bench marks can be obtained when both the old and new 
code is available to be benched mark together.  By running the bench marks of 
both codes together we could eliminate effects of noise if we assume noise at a 
given point of time would be applied to both sets of code.  Here is a modified 
version of the code that Andrew wrote previously to show this clearer than my 
words.

def compute_old():
    x = 0
    for i in range(1000):
        for j in range(1000):
            x = x + 1

def compute_new():
    x = 0
    for i in range(1000):
        for j in range(1000):
            x += 1

def bench():
    t1 = time.clock()
    compute_old()
    t2 = time.clock()
    compute_new()
    t3 = time.clock()
    return t2-t1, t3-t2

times_old = []
times_new = []
for i in range(1000):
    time_old, time_new = bench()
    times_old.append(time_old)
    times_new.append(time_new)

John
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to