On 10/20/2013 02:36 PM, Alex Gaynor wrote:
There's several issues involved in doing automated regression checking
for benchmarks:

- You need a platform which is stable. Right now all our CI runs on
virtualized instances, and I don't think there's any particular
guarantee it'll be the same underlying hardware, further virtualized
systems tend to be very noisy and not give you the stability you need.
- You need your benchmarks to be very high precision, if you really want
to rule out regressions of more than N% without a lot of false positives.
- You need more than just checks on individual builds, you need long
term trend checking - 100 1% regressions are worse than a single 50%
regression.

Alex

Agreed on all these points. However I think non of them change where the load generation scripts should be developed.

They mostly speak to ensuring that we've got a repeatable hardware environment for running the benchmark, and that we've got the right kind of data collection and analysis to make it stastically valid.

Point #1 is hard - as it really does require bare metal. But lets put that asside for now, as I think there might be clouds being made available that we could solve that.

But the rest of this is just software. If we had performance metering available in either the core servers or as part of Tempest we could get appropriate data. Then you'd need a good statistics engine to provide statisically relevant processing of that data. Not just line graphs, but real error bars and confidence intervals based on large numbers of runs. I've seen way too many line graphs arguing one point or another about config changes that turns out have error bars far beyond the results that are being seen. Any system that doesn't expose that isn't really going to be useful.

Actual performance regressions are going to be *really* hard to find in the gate, just because of the rate of code change that we have, and the variability we've seen on the guests.

Honestly, the statistics engine that actually just took in our existing large sets of data and got baseline variability would be a great step forward (that's new invention, no one has that right now). I'm sure we can figure out a good way to take the load generation into Tempest to be consistent with our existing validation and scenario tests. The metering could easily be proposed as a nova extension (ala coverage). And that seems to leave you with a setup tool, to pull this together in arbitrary environments.

And that's really what I mean about integrating better. Whenever possible figuring out how functionality could be added to existing projects, especially when that means they are enhanced not only for your use case, but for other use cases that those projects have wanted for a while (seriously, I'd love to have statistically valid run time statistics for tempest that show us when we go off the rails, like we did last week for a few days, and quantify long term variability and trends in the stack). It's harder in the short term to do that, because it means compromises along the way, but the long term benefit to OpenStack is much greater than another project which duplicates effort from a bunch of existing projects.

        -Sean

--
Sean Dague
http://dague.net

_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to