On Fri, Jun 25, 2010 at 19:08, Miquel Torres <tob...@googlemail.com> wrote: > Hi Paolo, > > I am aware of the problem with calculating benchmark means, but let me > explain my point of view. > > You are correct in that it would be preferable to have absolute times. Well, > you actually can, but see what it happens: > http://speed.pypy.org/comparison/?hor=true&bas=none&chart=stacked+bars
Ahah! I didn't notice that I could skip normalization! This does not fully invalidate my point, however. > Absolute values would only work if we had carefully chosen benchmaks > runtimes to be very similar (for our cpython baseline). As it is, html5lib, > spitfire and spitfire_cstringio completely dominate the cummulative time. I acknowledge that (btw, it should be cumulative time, with one 'm', both here and in the website). > And not because the interpreter is faster or slower but because the > benchmark was arbitrarily designed to run that long. Any improvement in the > long running benchmarks will carry much more weight than in the short > running. > What is more useful is to have comparable slices of time so that the > improvements can be seen relatively over time. If you want to sum up times (but at this point, I see no reason for it), you should rather have externally derived weights, as suggested by the paper (in Rule 3). As soon as you take weights from the data, lots of maths that you need is not going to work any more - that's generally true in many cases in statistics. And the only way making sense to have external weights is to gather them from real world programs. Since that's not going to happen easily, just stick with the geometric mean. Or set an arbitrarily low weight, manually, without any math, so that the long-running benchmarks stop dominating the res. It's no fraud, since the current graph is less valid anyway. > Normalizing does that i > think. Not really. > It just says: we have 21 tasks which take 1 second to run each on > interpreter X (cpython in the default case). Then we see how other > executables compare to that. What would the geometric mean achieve here, > exactly, for the end user? You actually need the geomean to do that. Don't forget that the geomean is still a mean: it's a mean performance ratio which averages individual performance ratios. If PyPy's geomean is 0.5, it means that PyPy is going to run that task in 11.5 seconds instead of 21. To me, this sounds exactly like what you want to achieve. Moreover, it actually works, unlike what you use. For instance, ignore PyPy-JIT, and look only CPython and pypy-c (no JIT). Then, change the normalization among the two: http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=2%2B35&chart=stacked+bars http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=3%2BL&chart=stacked+bars with the current data, you get that in one case cpython is faster, in the other pypy-c is faster. It can't happen with the geomean. This is the point of the paper. I could even construct a normalization baseline $base such that CPython seems faster than PyPy-JIT. Such a base should be very fast on, say, ai (where CPython is slower), so that $cpython.ai/$base.ai becomes 100 and $pypyjit.ai/$base.ai becomes 200, and be very slow on other benchmarks (so that they disappear in the sum). So, the only difference I see is that geomean works, arithm. mean doesn't. That's why Real Benchmarkers use geomean. Moreover, you are making a mistake quite common among non-physicists. What you say makes sense under the implicit assumption that dividing two times gives something you can use as a time. When you say "Pypy's runtime for a 1 second task", you actually want to talk about a performance ratio, not about the time. In the same way as when you say "this bird runs 3 meters long in one second", a physicist would sum that up as "3 m/s" rather than "3 m". > I am not really calculating any mean. You can see that I carefully avoided > to display any kind of total bar which would indeed incur in the problem you > mention. That a stacked chart implicitly displays a total is something you > can not avoid, and for that kind of chart I still think normalized results > is visually the best option. But on a stacked bars graph, I'm not going to look at individual bars at all, just at the total: it's actually less convenient than in "normal bars" to look at the result of a particular benchmark. I hope I can find guidelines against stacked plots, I have a PhD colleague reading on how to make graphs. Best regards -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/ _______________________________________________ pypy-...@codespeak.net http://codespeak.net/mailman/listinfo/pypy-dev