This is still a partially formed idea, but one of the ways I've thought of 
solving the timeout issue is by using aggregate data from previous results. Any 
of the event systems (Ceilometer, StackTach) should allow you to create models 
of what an inbounds server build (or any action) time per image looks like, and 
what might be considered an outlier. The general idea feels right, but also 
could be tricked by slow, increasing changes in the execution time of an action.

Daryl

-----Original Message-----
From: Matt Riedemann [mailto:[email protected]] 
Sent: Monday, May 19, 2014 10:54 AM
To: OpenStack Development Mailing List (not for usage questions)
Subject: [openstack-dev] [qa] Smarter timeouts in Tempest?

I was looking through this timeout bug [1] this morning and am able to 
correlate that around the time of the image snapshot timeout, ceilometer was 
really hammering CPU on the host.  There are already threads on ceilometer 
performance and how that needs to be improved for Tempest runs so I don't want 
to get into that here.

What I'm thinking about is if there is a way to be smarter about how we do 
timeouts in the tests, rather than just rely on globally configured hard-coded 
timeouts which are bound to fail intermittently in dynamic environments like 
this.

I'm thinking something along the lines of keeping track of CPU stats on 
intervals in our waiter loops, then when we reach our configured timeout, 
calculate the average CPU load/idle and if it falls below some threshold, we 
cut the timeout in half and redo the timeout loop - and we continue that until 
our timeout reaches some level that no longer makes sense, like once it drops 
less than a minute for example.

Are there other ideas here?  My main concern is the number of random timeout 
failures we see in the tests and then people are trying to fingerprint them 
with elastic-recheck but the queries are so generic they are not really useful. 
 We now put the test class and test case in the compute test timeout messages, 
but it's also not very useful to fingerprint every individual permutation of 
test class/case that we can hit a timeout in.

[1] https://bugs.launchpad.net/nova/+bug/1320617

-- 

Thanks,

Matt Riedemann


_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to