I see this as the result of two unrelated issues.
On Mon, May 19, 2014 at 8:53 AM, Matt Riedemann <[email protected]>wrote: > I was looking through this timeout bug [1] this morning and am able to > correlate that around the time of the image snapshot timeout, ceilometer > was really hammering CPU on the host. There are already threads on > ceilometer performance and how that needs to be improved for Tempest runs > so I don't want to get into that here. > > What I'm thinking about is if there is a way to be smarter about how we do > timeouts in the tests, rather than just rely on globally configured > hard-coded timeouts which are bound to fail intermittently in dynamic > environments like this. > > I'm thinking something along the lines of keeping track of CPU stats on > intervals in our waiter loops, then when we reach our configured timeout, > calculate the average CPU load/idle and if it falls below some threshold, > we cut the timeout in half and redo the timeout loop - and we continue that > until our timeout reaches some level that no longer makes sense, like once > it drops less than a minute for example > 1. Our test environment is being pushed to its limits. In the past we have seen things fail in strange ways when CPU idle % drops below 10%. To address this we can do a few things: * Better track when our test environment has low idle CPU (post processing on gate jobs?) * Make gate jobs use less CPU (ceilometer issues etc). > > Are there other ideas here? My main concern is the number of random > timeout failures we see in the tests and then people are trying to > fingerprint them with elastic-recheck but the queries are so generic they > are not really useful. We now put the test class and test case in the > compute test timeout messages, but it's also not very useful to fingerprint > every individual permutation of test class/case that we can hit a timeout > in. > 2. OpenStack is hard to debug. If we, the developers, cannot figure out what is failing then imagine how hard debugging is for non-openstack developers. When we see these types of issues, we should work on making the logs more useful so we can create better elastic-rececheck fingerprints. > > [1] https://bugs.launchpad.net/nova/+bug/1320617 > > -- > > Thanks, > > Matt Riedemann > > > _______________________________________________ > OpenStack-dev mailing list > [email protected] > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >
_______________________________________________ OpenStack-dev mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
