Hi Paolo, well, you are right of course. I had forgotten about the real problem, which you actually demonstrate quite well with your CPython and pypy-c case: depending on the normalization you can make any stacked series look faster than the others.
I will have a look at the literature and modify normalized stacked plots accordingly. Thanks for taking the time to explain things in such detail. Regards, Miquel 2010/6/25 Paolo Giarrusso <[email protected]> > On Fri, Jun 25, 2010 at 19:08, Miquel Torres <[email protected]> > wrote: > > Hi Paolo, > > > > I am aware of the problem with calculating benchmark means, but let me > > explain my point of view. > > > > You are correct in that it would be preferable to have absolute times. > Well, > > you actually can, but see what it happens: > > http://speed.pypy.org/comparison/?hor=true&bas=none&chart=stacked+bars > > Ahah! I didn't notice that I could skip normalization! This does not > fully invalidate my point, however. > > > Absolute values would only work if we had carefully chosen benchmaks > > runtimes to be very similar (for our cpython baseline). As it is, > html5lib, > > spitfire and spitfire_cstringio completely dominate the cummulative time. > > I acknowledge that (btw, it should be cumulative time, with one 'm', > both here and in the website). > > > And not because the interpreter is faster or slower but because the > > benchmark was arbitrarily designed to run that long. Any improvement in > the > > long running benchmarks will carry much more weight than in the short > > running. > > > What is more useful is to have comparable slices of time so that the > > improvements can be seen relatively over time. > > If you want to sum up times (but at this point, I see no reason for > it), you should rather have externally derived weights, as suggested > by the paper (in Rule 3). > As soon as you take weights from the data, lots of maths that you need > is not going to work any more - that's generally true in many cases in > statistics. > And the only way making sense to have external weights is to gather > them from real world programs. Since that's not going to happen > easily, just stick with the geometric mean. Or set an arbitrarily low > weight, manually, without any math, so that the long-running > benchmarks stop dominating the res. It's no fraud, since the current > graph is less valid anyway. > > > Normalizing does that i > > think. > Not really. > > > It just says: we have 21 tasks which take 1 second to run each on > > interpreter X (cpython in the default case). Then we see how other > > executables compare to that. What would the geometric mean achieve here, > > exactly, for the end user? > > You actually need the geomean to do that. Don't forget that the > geomean is still a mean: it's a mean performance ratio which averages > individual performance ratios. > If PyPy's geomean is 0.5, it means that PyPy is going to run that task > in 11.5 seconds instead of 21. To me, this sounds exactly like what > you want to achieve. Moreover, it actually works, unlike what you use. > > For instance, ignore PyPy-JIT, and look only CPython and pypy-c (no > JIT). Then, change the normalization among the two: > > http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=2%2B35&chart=stacked+bars > > http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=3%2BL&chart=stacked+bars > with the current data, you get that in one case cpython is faster, in > the other pypy-c is faster. > It can't happen with the geomean. This is the point of the paper. > > I could even construct a normalization baseline $base such that > CPython seems faster than PyPy-JIT. Such a base should be very fast > on, say, ai (where CPython is slower), so that $cpython.ai/$base.ai > becomes 100 and $pypyjit.ai/$base.ai becomes 200, and be very slow on > other benchmarks (so that they disappear in the sum). > > So, the only difference I see is that geomean works, arithm. mean > doesn't. That's why Real Benchmarkers use geomean. > > Moreover, you are making a mistake quite common among non-physicists. > What you say makes sense under the implicit assumption that dividing > two times gives something you can use as a time. When you say "Pypy's > runtime for a 1 second task", you actually want to talk about a > performance ratio, not about the time. In the same way as when you say > "this bird runs 3 meters long in one second", a physicist would sum > that up as "3 m/s" rather than "3 m". > > > I am not really calculating any mean. You can see that I carefully > avoided > > to display any kind of total bar which would indeed incur in the problem > you > > mention. That a stacked chart implicitly displays a total is something > you > > can not avoid, and for that kind of chart I still think normalized > results > > is visually the best option. > > But on a stacked bars graph, I'm not going to look at individual bars > at all, just at the total: it's actually less convenient than in > "normal bars" to look at the result of a particular benchmark. > > I hope I can find guidelines against stacked plots, I have a PhD > colleague reading on how to make graphs. > > Best regards > -- > Paolo Giarrusso - Ph.D. Student > http://www.informatik.uni-marburg.de/~pgiarrusso/<http://www.informatik.uni-marburg.de/%7Epgiarrusso/> >
_______________________________________________ [email protected] http://codespeak.net/mailman/listinfo/pypy-dev
