On 10/20/2013 03:03 PM, Robert Collins wrote:
On 21 October 2013 07:36, Alex Gaynor <alex.gay...@gmail.com> wrote:
There's several issues involved in doing automated regression checking for
benchmarks:

- You need a platform which is stable. Right now all our CI runs on
virtualized instances, and I don't think there's any particular guarantee
it'll be the same underlying hardware, further virtualized systems tend to
be very noisy and not give you the stability you need.
- You need your benchmarks to be very high precision, if you really want to
rule out regressions of more than N% without a lot of false positives.
- You need more than just checks on individual builds, you need long term
trend checking - 100 1% regressions are worse than a single 50% regression.
Let me offer a couple more key things:
  - you need a platform that is representative of your deployments:
1000 physical hypervisors have rather different checkin patterns than
1 qemu hypervisor.
  - you need a workload that is representative of your deployments:
10000 VM's spread over 500 physical hypervisors routing traffic
through one neutron software switch will have rather different load
characteristics than 5 qemu vm's in a kvm vm hosted all in one
configuration.

neither the platform - # of components, their configuration, etc, nor
the workload in devstack-gate are representative of production
deployments of any except the most modest clouds. Thats fine -
devstack-gate to date has been about base functionality, not digging
down into race conditions.

I think having a dedicated tool aimed at:
  - setting up *many different* production-like environments and running
  - many production-like workloads and
  - reporting back which ones work and which ones don't

makes a huge amount of sense.

from the reports from that tool we can craft targeted unit test or
isolated functional tests to capture the problem and prevent it
worsening or regressing (once fixed). See for instance Joe Gordons'
fake hypervisor which is great for targeted testing.

That said, I also agree with the sentiment expressed that the
workload-driving portion of Rally doesn't seem different enough to
Tempest to warrant being separate; it seems to me that Rally could be
built like this:

- a thing that does deployments spread out over a phase space of configurations
- instrumentation for deployments that permit the data visibility
needed to analyse problems
- tests for tempest that stress a deployment

So the single-button-push Rally would:
  - take a set of hardware
  - in a loop
  - deploy a configuration, run Tempest, report data

That would reuse Tempest and still be a single button push data
gathering thing, and if Tempest isn't capable of generating enough
concurrency/load [for a single test - ignore parallel execution of
different tests] then that seems like something we should fix in
Tempest, because concurrency/race conditions are things we need tests
for in devstack-gate.

-Rob

I don't think this is an issue for Tempest. The existing stress tests do exactly that, but without the gathering of performance data at this point. I think my initial negative reaction to adding the new functionality to tempest was our old friend the core reviewer issue. It is already the case that being a core reviewer in tempest covers a lot of ground. But on second thought the tempest stress stuff would be better off having these capabilities.

Speaking of stress tests, at present the tempest stress tests run nightly. Should we consider adding 1/2 hour's worth or so in another gate job? Since the stress tests fail if there are log errors, we first have to allow the gate to fail if there are errors in the logs after successful runs. I have been working on it and should be ready to do that post-summit https://blueprints.launchpad.net/tempest/+spec/fail-gate-on-log-errors.

 -David


_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to