Hey. I'll answer questions that are relevant to benchmarks themselves and not running.
On Wed, Mar 10, 2010 at 4:45 PM, Bengt Richter <[email protected]> wrote: > On 03/10/2010 12:14 PM Miquel Torres wrote: >> Hi! >> >> I wanted to explain a couple of things about the speed website: >> >> - New feature: the Timeline view now defaults to a plot grid, showing >> all benchmarks at the same time. It was a feature request made more >> than once, so depending on personal tastes, you can bookmark either >> /overview/ or /timeline/. Thanks go to nsf for helping with the >> implementation. >> - The code has now moved to github as Codespeed, a benchmark >> visualization framework (http://github.com/tobami/codespeed) >> - I have updated speed.pypy.org with version 0.3. Much of the work has >> been under the hood to make it feasible for other projects to use >> codespeed as a framework. >> >> For those interested in further development you can go to the releases >> wiki (still a work in progress): >> http://wiki.github.com/tobami/codespeed/releases >> >> Next in the line are some DB changes to be able to save standard >> deviation data and the like. Long term goals besides world domination >> are integration with buildbot and similarly unrealistic things. >> Feedback is always welcome. > > Nice looking stuff. But a couple comments: > > 1. IMO standard deviation is too often worse than useless, since it hides > the true nature of the distribution. I think the assumption of normalcy > is highly suspect for benchmark timings, and pruning may hide interesting > clusters. > > I prefer to look at scattergrams, where things like clustering and > correlations > are easily apparent to the eye, as well as the amount of data (assuming a > good > mapping of density to visuals). That's true. In general a benchmark run over time is a period of warmup, when JIT compiles assembler followed by thing that can be described by average and std devation. Personally I would like to have those 3 measures separated, but didn't implement that yet (it has also some interesting statistical questions involved). Std deviation is useful to get whether a difference was meaningful of certain checkin or just noise. > > 2. IMO benchmark timings are like travel times, comparing different vehicles. > (pypy with jit being a vehicle capable of dynamic self-modification ;-) > E.g., which part of travel from Stockholm to Paris would you concentrate > on improving to improve the overall result? How about travel from Brussels > to Paris? > Or Paris to Sydney? ;-P Different things come into play in different > benchmarks/trips. > A Porsche Turbo and a 2CV will both have to wait for a ferry, if that's > part of the trip. > > IOW, it would be nice to see total time broken down somehow, to see what's > really > happening. I can't agree more with that. We already do split time when we perform benchmarks by hand, but they're not yet integrated into the whole nightly run. Total time is what users see though, that's why our public site is focused on that. I want more information available, but we have only limited amount of manpower and miquel already did quite amazing job in my opinion :-) We'll probably go into more details. The part we want to focus on past-release is speeding up certain parts of tracing as well as limiting it's GC pressure. As you can see, the split would be very useful for our development. > > Don't get me wrong, the total times are certainly useful indicators of > progress > (which has been amazing). > > 3. Speed is ds/dt and you are showing the integral of dt/ds over the trip > distance to get time. > A 25% improvement in total time is not a 25% improvement in speed. I.e., > (if you define > improvement as a percentage change in a desired direction), for e.g. 25%: > distance/(0.75*time) != 1.25*(distance/time). > > IMO 'speed' (the implication to me in the name speed.pypy.org) would be > benchmarks/time > more appropriately than time/benchmark. > > Both measures are useful, but time percentages are easy to > mis{use,construe} ;-) That's correct. Benchmarks are in general very easy to lie about and they're by definition flawed. That's why I always include raw data when I publish stuff on the blog, so people can work on it themselves. > > 4. Is there any memory footprint data? > No. Memory measurment is hard and it's even less useful without breaking down. Those particular benchmarks are not very good basis for memory measurment - in case of pypy you would mostly observe the default allocated memory (which is roughly 10M for the interpreter + 16M for semispace GC + cache for nursery). Also our GC is of a kind that it can run faster if you give it more memory (not that we use this feature, but it's possible). Cheers, fijal _______________________________________________ [email protected] http://codespeak.net/mailman/listinfo/pypy-dev
