Hi Paolo, hey! I think it is a great idea. With logs you get both: correct normalized totals AND the ability to display the individual stacked series, which necessarily add arithmetically. But it strikes me, hasn't anyone written a paper about that method already? or at least documented it?
Anyway I need to check that the math is right (hopefully today), and then I would go and implement it. I'll tell you how it goes. Cheers, Miquel 2010/6/30 Paolo Giarrusso <[email protected]>: > Hi Miquel, > I'm quite busy (because of a paper deadline next Tuesday), sorry for > not answering earlier. > > I was just struck by an idea: there is a stacked bar plot where the > total bar is related to the geometric mean, such that it is > normalization-invariant. But this graph _is_ complicated. > > It is a stacked plot of _logarithms_ of performance ratios? This way, > the complete stacked bar shows the logarithm of the product, rather > than their sum, i.e. the log of the (geometric mean)^N rather than > their arithmetic mean. log of the (geometric mean)^N = N*log of the > (geometric mean). > > Some simple maths (I didn't write it out, so please recheck!) seems to > show that showing (a+b*log (ratio)), instead of log(ratio), gives > still a fair comparison, obtaining N*a+b*N*log(geomean) = > \Theta(log(geomean)). You need to put a and b because showing if the > ratio is 1, log(1) is zero (b is the representation scale which is > always there). > > About your workaround: I would like a table with the geometric mean of > the ratios, where we get the real global performance ratio among the > interpreters. As far as the results of your solution do not contradict > that _real_ table, it should be a reasonable workaround (but I would > embed the check in the code - otherwise other projects _will be_ > bitten by that). Probably, I would like the website to offer such a > table to users, and I would like a graph of the overall performance > ratio over time (actually revisions). > > Finally, the docs of your web application should at the very least > reference the paper and this conversation (if there's a public archive > of the ML, as I think), and ideally explain the issue. > > Sorry for being too dense, maybe - if I was unclear, please tell me > and I'll answer next week. > > Best regards, > Paolo > > On Mon, Jun 28, 2010 at 11:21, Miquel Torres <[email protected]> wrote: >> Hi Paolo, >> >> I read the paper, very interesting. It is perfectly clear that to >> calculate a normalized total only the geometric mean makes sense. >> >> However, a stacked bars plot shows the individual benchmarks so it >> implicitly is an arithmetic mean. The only solution (apart from >> removing the stacked charts and only offering total bars) is the >> weighted approach. >> >> External weights are not very practical though. Codespeed is used by >> other projects so an extra option would need to be added to the >> settings to allow the introducing of arbitrary weights to benchmarks. >> A bit cumbersome. I have an idea that may work. Take the weights from >> a defined baseline so that the run times are equal, which is the same >> as normalizing to a baseline. It would be the same as now, only that >> you can't choose the normalization, it will be weighted (normalized) >> according the default baseline (which you already can already >> configure in the settings). >> >> You may say that it is still an arithmetic mean, but there won't be >> conflicting results because there is only a single normalization. For >> PyPy that would be cpython, and everything would make sense. >> I know it is a work around, not a solution. If you think it is a bad >> idea, the only other possibility is not to have stacked bars (as in >> "showing individual benchmarks"). But I find them useful. Yes you can >> see the individual benchmark results better in the normal bars chart, >> but there you don't see visually which benchmarks take the biggest >> part of the pie, which helps visualize what parts of your program need >> most improving. >> >> What do you think? >> >> Regards, >> Miquel >> >> >> 2010/6/25 Paolo Giarrusso <[email protected]>: >>> On Fri, Jun 25, 2010 at 19:08, Miquel Torres <[email protected]> wrote: >>>> Hi Paolo, >>>> >>>> I am aware of the problem with calculating benchmark means, but let me >>>> explain my point of view. >>>> >>>> You are correct in that it would be preferable to have absolute times. >>>> Well, >>>> you actually can, but see what it happens: >>>> http://speed.pypy.org/comparison/?hor=true&bas=none&chart=stacked+bars >>> >>> Ahah! I didn't notice that I could skip normalization! This does not >>> fully invalidate my point, however. >>> >>>> Absolute values would only work if we had carefully chosen benchmaks >>>> runtimes to be very similar (for our cpython baseline). As it is, html5lib, >>>> spitfire and spitfire_cstringio completely dominate the cummulative time. >>> >>> I acknowledge that (btw, it should be cumulative time, with one 'm', >>> both here and in the website). >>> >>>> And not because the interpreter is faster or slower but because the >>>> benchmark was arbitrarily designed to run that long. Any improvement in the >>>> long running benchmarks will carry much more weight than in the short >>>> running. >>> >>>> What is more useful is to have comparable slices of time so that the >>>> improvements can be seen relatively over time. >>> >>> If you want to sum up times (but at this point, I see no reason for >>> it), you should rather have externally derived weights, as suggested >>> by the paper (in Rule 3). >>> As soon as you take weights from the data, lots of maths that you need >>> is not going to work any more - that's generally true in many cases in >>> statistics. >>> And the only way making sense to have external weights is to gather >>> them from real world programs. Since that's not going to happen >>> easily, just stick with the geometric mean. Or set an arbitrarily low >>> weight, manually, without any math, so that the long-running >>> benchmarks stop dominating the res. It's no fraud, since the current >>> graph is less valid anyway. >>> >>>> Normalizing does that i >>>> think. >>> Not really. >>> >>>> It just says: we have 21 tasks which take 1 second to run each on >>>> interpreter X (cpython in the default case). Then we see how other >>>> executables compare to that. What would the geometric mean achieve here, >>>> exactly, for the end user? >>> >>> You actually need the geomean to do that. Don't forget that the >>> geomean is still a mean: it's a mean performance ratio which averages >>> individual performance ratios. >>> If PyPy's geomean is 0.5, it means that PyPy is going to run that task >>> in 11.5 seconds instead of 21. To me, this sounds exactly like what >>> you want to achieve. Moreover, it actually works, unlike what you use. >>> >>> For instance, ignore PyPy-JIT, and look only CPython and pypy-c (no >>> JIT). Then, change the normalization among the two: >>> http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=2%2B35&chart=stacked+bars >>> http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=3%2BL&chart=stacked+bars >>> with the current data, you get that in one case cpython is faster, in >>> the other pypy-c is faster. >>> It can't happen with the geomean. This is the point of the paper. >>> >>> I could even construct a normalization baseline $base such that >>> CPython seems faster than PyPy-JIT. Such a base should be very fast >>> on, say, ai (where CPython is slower), so that $cpython.ai/$base.ai >>> becomes 100 and $pypyjit.ai/$base.ai becomes 200, and be very slow on >>> other benchmarks (so that they disappear in the sum). >>> >>> So, the only difference I see is that geomean works, arithm. mean >>> doesn't. That's why Real Benchmarkers use geomean. >>> >>> Moreover, you are making a mistake quite common among non-physicists. >>> What you say makes sense under the implicit assumption that dividing >>> two times gives something you can use as a time. When you say "Pypy's >>> runtime for a 1 second task", you actually want to talk about a >>> performance ratio, not about the time. In the same way as when you say >>> "this bird runs 3 meters long in one second", a physicist would sum >>> that up as "3 m/s" rather than "3 m". >>> >>>> I am not really calculating any mean. You can see that I carefully avoided >>>> to display any kind of total bar which would indeed incur in the problem >>>> you >>>> mention. That a stacked chart implicitly displays a total is something you >>>> can not avoid, and for that kind of chart I still think normalized results >>>> is visually the best option. >>> >>> But on a stacked bars graph, I'm not going to look at individual bars >>> at all, just at the total: it's actually less convenient than in >>> "normal bars" to look at the result of a particular benchmark. >>> >>> I hope I can find guidelines against stacked plots, I have a PhD >>> colleague reading on how to make graphs. >>> >>> Best regards >>> -- >>> Paolo Giarrusso - Ph.D. Student >>> http://www.informatik.uni-marburg.de/~pgiarrusso/ >>> >> > > > > -- > Paolo Giarrusso - Ph.D. Student > http://www.informatik.uni-marburg.de/~pgiarrusso/ > _______________________________________________ [email protected] http://codespeak.net/mailman/listinfo/pypy-dev
