Re: [openstack-dev] [qa] Smarter timeouts in Tempest?

Joe Gordon Mon, 19 May 2014 20:29:14 -0700

I see this as the result of two unrelated issues.


On Mon, May 19, 2014 at 8:53 AM, Matt Riedemann
<[email protected]>wrote:

> I was looking through this timeout bug [1] this morning and am able to
> correlate that around the time of the image snapshot timeout, ceilometer
> was really hammering CPU on the host.  There are already threads on
> ceilometer performance and how that needs to be improved for Tempest runs
> so I don't want to get into that here.
>
> What I'm thinking about is if there is a way to be smarter about how we do
> timeouts in the tests, rather than just rely on globally configured
> hard-coded timeouts which are bound to fail intermittently in dynamic
> environments like this.
>
> I'm thinking something along the lines of keeping track of CPU stats on
> intervals in our waiter loops, then when we reach our configured timeout,
> calculate the average CPU load/idle and if it falls below some threshold,
> we cut the timeout in half and redo the timeout loop - and we continue that
> until our timeout reaches some level that no longer makes sense, like once
> it drops less than a minute for example
>

1.  Our test environment is being pushed to its limits. In the past we have
seen things fail in strange ways when CPU idle % drops below 10%. To
address this we can do a few things:

  * Better track when our test environment has low idle CPU (post
processing on gate jobs?)
  * Make gate jobs use less CPU (ceilometer issues etc).


>
> Are there other ideas here?  My main concern is the number of random
> timeout failures we see in the tests and then people are trying to
> fingerprint them with elastic-recheck but the queries are so generic they
> are not really useful.  We now put the test class and test case in the
> compute test timeout messages, but it's also not very useful to fingerprint
> every individual permutation of test class/case that we can hit a timeout
> in.
>

2. OpenStack is hard to debug.  If we, the developers, cannot figure out
what is failing then imagine how hard debugging is for non-openstack
developers.  When we see these types of issues, we should work on making
the logs more useful so we can create better elastic-rececheck fingerprints.


>
> [1] https://bugs.launchpad.net/nova/+bug/1320617
>
> --
>
> Thanks,
>
> Matt Riedemann
>
>
> _______________________________________________
> OpenStack-dev mailing list
> [email protected]
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [qa] Smarter timeouts in Tempest?

Reply via email to