On Fri, Jul 2, 2010 at 09:27, Miquel Torres <[email protected]> wrote: > Hi Paolo, > > hey! I think it is a great idea. With logs you get both: correct > normalized totals AND the ability to display the individual stacked > series, which necessarily add arithmetically. But it strikes me, > hasn't anyone written a paper about that method already? or at least > documented it?
I guess the problem is that the graph is weird enough, and that you need arbitrary a and b to make it work, since the logarithm might get negative, and arbitrarily big. log 0 = - inf. I still think that's fair and makes sense, but it's somewhat hard to sell. > Anyway I need to check that the math is right (hopefully today), and > then I would go and implement it. > I'll tell you how it goes. > > Cheers, > Miquel > > > > 2010/6/30 Paolo Giarrusso <[email protected]>: >> Hi Miquel, >> I'm quite busy (because of a paper deadline next Tuesday), sorry for >> not answering earlier. >> >> I was just struck by an idea: there is a stacked bar plot where the >> total bar is related to the geometric mean, such that it is >> normalization-invariant. But this graph _is_ complicated. >> >> It is a stacked plot of _logarithms_ of performance ratios? This way, >> the complete stacked bar shows the logarithm of the product, rather >> than their sum, i.e. the log of the (geometric mean)^N rather than >> their arithmetic mean. log of the (geometric mean)^N = N*log of the >> (geometric mean). >> >> Some simple maths (I didn't write it out, so please recheck!) seems to >> show that showing (a+b*log (ratio)), instead of log(ratio), gives >> still a fair comparison, obtaining N*a+b*N*log(geomean) = >> \Theta(log(geomean)). You need to put a and b because showing if the >> ratio is 1, log(1) is zero (b is the representation scale which is >> always there). >> >> About your workaround: I would like a table with the geometric mean of >> the ratios, where we get the real global performance ratio among the >> interpreters. As far as the results of your solution do not contradict >> that _real_ table, it should be a reasonable workaround (but I would >> embed the check in the code - otherwise other projects _will be_ >> bitten by that). Probably, I would like the website to offer such a >> table to users, and I would like a graph of the overall performance >> ratio over time (actually revisions). >> >> Finally, the docs of your web application should at the very least >> reference the paper and this conversation (if there's a public archive >> of the ML, as I think), and ideally explain the issue. >> >> Sorry for being too dense, maybe - if I was unclear, please tell me >> and I'll answer next week. >> >> Best regards, >> Paolo >> >> On Mon, Jun 28, 2010 at 11:21, Miquel Torres <[email protected]> wrote: >>> Hi Paolo, >>> >>> I read the paper, very interesting. It is perfectly clear that to >>> calculate a normalized total only the geometric mean makes sense. >>> >>> However, a stacked bars plot shows the individual benchmarks so it >>> implicitly is an arithmetic mean. The only solution (apart from >>> removing the stacked charts and only offering total bars) is the >>> weighted approach. >>> >>> External weights are not very practical though. Codespeed is used by >>> other projects so an extra option would need to be added to the >>> settings to allow the introducing of arbitrary weights to benchmarks. >>> A bit cumbersome. I have an idea that may work. Take the weights from >>> a defined baseline so that the run times are equal, which is the same >>> as normalizing to a baseline. It would be the same as now, only that >>> you can't choose the normalization, it will be weighted (normalized) >>> according the default baseline (which you already can already >>> configure in the settings). >>> >>> You may say that it is still an arithmetic mean, but there won't be >>> conflicting results because there is only a single normalization. For >>> PyPy that would be cpython, and everything would make sense. >>> I know it is a work around, not a solution. If you think it is a bad >>> idea, the only other possibility is not to have stacked bars (as in >>> "showing individual benchmarks"). But I find them useful. Yes you can >>> see the individual benchmark results better in the normal bars chart, >>> but there you don't see visually which benchmarks take the biggest >>> part of the pie, which helps visualize what parts of your program need >>> most improving. >>> >>> What do you think? >>> >>> Regards, >>> Miquel >>> >>> >>> 2010/6/25 Paolo Giarrusso <[email protected]>: >>>> On Fri, Jun 25, 2010 at 19:08, Miquel Torres <[email protected]> wrote: >>>>> Hi Paolo, >>>>> >>>>> I am aware of the problem with calculating benchmark means, but let me >>>>> explain my point of view. >>>>> >>>>> You are correct in that it would be preferable to have absolute times. >>>>> Well, >>>>> you actually can, but see what it happens: >>>>> http://speed.pypy.org/comparison/?hor=true&bas=none&chart=stacked+bars >>>> >>>> Ahah! I didn't notice that I could skip normalization! This does not >>>> fully invalidate my point, however. >>>> >>>>> Absolute values would only work if we had carefully chosen benchmaks >>>>> runtimes to be very similar (for our cpython baseline). As it is, >>>>> html5lib, >>>>> spitfire and spitfire_cstringio completely dominate the cummulative time. >>>> >>>> I acknowledge that (btw, it should be cumulative time, with one 'm', >>>> both here and in the website). >>>> >>>>> And not because the interpreter is faster or slower but because the >>>>> benchmark was arbitrarily designed to run that long. Any improvement in >>>>> the >>>>> long running benchmarks will carry much more weight than in the short >>>>> running. >>>> >>>>> What is more useful is to have comparable slices of time so that the >>>>> improvements can be seen relatively over time. >>>> >>>> If you want to sum up times (but at this point, I see no reason for >>>> it), you should rather have externally derived weights, as suggested >>>> by the paper (in Rule 3). >>>> As soon as you take weights from the data, lots of maths that you need >>>> is not going to work any more - that's generally true in many cases in >>>> statistics. >>>> And the only way making sense to have external weights is to gather >>>> them from real world programs. Since that's not going to happen >>>> easily, just stick with the geometric mean. Or set an arbitrarily low >>>> weight, manually, without any math, so that the long-running >>>> benchmarks stop dominating the res. It's no fraud, since the current >>>> graph is less valid anyway. >>>> >>>>> Normalizing does that i >>>>> think. >>>> Not really. >>>> >>>>> It just says: we have 21 tasks which take 1 second to run each on >>>>> interpreter X (cpython in the default case). Then we see how other >>>>> executables compare to that. What would the geometric mean achieve here, >>>>> exactly, for the end user? >>>> >>>> You actually need the geomean to do that. Don't forget that the >>>> geomean is still a mean: it's a mean performance ratio which averages >>>> individual performance ratios. >>>> If PyPy's geomean is 0.5, it means that PyPy is going to run that task >>>> in 11.5 seconds instead of 21. To me, this sounds exactly like what >>>> you want to achieve. Moreover, it actually works, unlike what you use. >>>> >>>> For instance, ignore PyPy-JIT, and look only CPython and pypy-c (no >>>> JIT). Then, change the normalization among the two: >>>> http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=2%2B35&chart=stacked+bars >>>> http://speed.pypy.org/comparison/?exe=2%2B35,3%2BL&ben=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21&env=1&hor=true&bas=3%2BL&chart=stacked+bars >>>> with the current data, you get that in one case cpython is faster, in >>>> the other pypy-c is faster. >>>> It can't happen with the geomean. This is the point of the paper. >>>> >>>> I could even construct a normalization baseline $base such that >>>> CPython seems faster than PyPy-JIT. Such a base should be very fast >>>> on, say, ai (where CPython is slower), so that $cpython.ai/$base.ai >>>> becomes 100 and $pypyjit.ai/$base.ai becomes 200, and be very slow on >>>> other benchmarks (so that they disappear in the sum). >>>> >>>> So, the only difference I see is that geomean works, arithm. mean >>>> doesn't. That's why Real Benchmarkers use geomean. >>>> >>>> Moreover, you are making a mistake quite common among non-physicists. >>>> What you say makes sense under the implicit assumption that dividing >>>> two times gives something you can use as a time. When you say "Pypy's >>>> runtime for a 1 second task", you actually want to talk about a >>>> performance ratio, not about the time. In the same way as when you say >>>> "this bird runs 3 meters long in one second", a physicist would sum >>>> that up as "3 m/s" rather than "3 m". >>>> >>>>> I am not really calculating any mean. You can see that I carefully avoided >>>>> to display any kind of total bar which would indeed incur in the problem >>>>> you >>>>> mention. That a stacked chart implicitly displays a total is something you >>>>> can not avoid, and for that kind of chart I still think normalized results >>>>> is visually the best option. >>>> >>>> But on a stacked bars graph, I'm not going to look at individual bars >>>> at all, just at the total: it's actually less convenient than in >>>> "normal bars" to look at the result of a particular benchmark. >>>> >>>> I hope I can find guidelines against stacked plots, I have a PhD >>>> colleague reading on how to make graphs. >>>> >>>> Best regards >>>> -- >>>> Paolo Giarrusso - Ph.D. Student >>>> http://www.informatik.uni-marburg.de/~pgiarrusso/ >>>> >>> >> >> >> >> -- >> Paolo Giarrusso - Ph.D. Student >> http://www.informatik.uni-marburg.de/~pgiarrusso/ >> > -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/ _______________________________________________ [email protected] http://codespeak.net/mailman/listinfo/pypy-dev
