On 6/3/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > - I would average the timings of runs instead of taking the minimum value as > sometimes bench marks could be running code that is not deterministic in its > calculations (could be using random numbers that effect convergence).
I would rewrite those to be deterministic. Any benchmarks of mine which use random numbers initializes the generator with a fixed seed and does it in such a way that the order or choice of subbenchmarks does not affect the individual results. Any other way is madness. > - Before calculating the average number I would throw out samples > outside 3 sigmas (the outliers). As I've insisted, talking about sigmas assumes a Gaussian distribution. It's more likely that the timing variations (at least in stringbench) are closer to a gamma distribution. > Here is a passage I found ... ... > Unfortunately, defining an outlier is subjective (as it should be), and the > decisions concerning how to identify them must be made on an individual > basis (taking into account specific experimental paradigms The experimental paradigm I've been using is: - precise and accurate clock on timescales much smaller than the benchmark (hence continuous distributions) - rare, random, short and uncorrelated interruptions This leads to a gamma distribution (plus constant offset for minimum compute time) Gamma distributions have longer tails than Gaussians and hence more "outliers". If you think that averaging is useful then throwing those outliers away will artificially lower the average value. To me, using the minimum time, given the paradigm, makes sense. How fast is the fastest runner in the world? Do you have him run a dozen times and get the average, or use the shortest time? > I usually use this approach while reviewing data obtained by fairly > accurate sensors so being being conservative using 3 sigmas works > well for these cases. That uses a different underlying physical process which is better modeled by Gaussians. Consider this. For a given benchmark there is an absolute minimum time for it to run on a given machine. Suppose this is 10 seconds and the benchmark timing comes out 10.2 seconds. The +0.2 comes from background overhead, though you don't know exactly what's due to overhead and what's real. If the times were Gaussian then there's as much chance of getting benchmark times of 10.5 seconds as of 9.9 seconds. But repeat the benchmark as many times as you want and you'll never see 9.9 seconds, though you will see 10.5. > - Another improvement to bench marks can be obtained when both > the old and new code is available to be benched mark together. That's what stringbench does, comparing unicode and 8-bit strings. However, how do you benchmark changes which are more complicated than that? For example, benchmark changes to the exception mechanism, or builds under gcc 3.x and 4.x. Andrew [EMAIL PROTECTED] _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com