Things aren't great, but they are actually better than yesterday. Vital Stats: Gate queue length: 107 Check queue length: 107 Head of gate entered: 45hrs ago Changes merged in last 24hrs: 58
The 58 changes merged is actually a good number, not a great number, but best we've seen in a number of days. I saw at least a 6 streak merge yesterday, so zuul is starting to behave like we expect it should. = Previous Top Bugs = Our previous top 2 issues - 1270680 and 1270608 (not confusing at all) are under control. Bug 1270680 - v3 extensions api inherently racey wrt instances Russell managed the second part of the fix for this, we've not seen it come back since that was ninja merged. Bug 1270608 - n-cpu 'iSCSI device not found' log causes gate-tempest-dsvm-*-full to fail Turning off the test that was triggering this made it completely go away. We'll have to revisit if that's because there is a cinder bug or a tempest bug, but we'll do that once the dust has settled. = New Top Bugs = Note: all fail numbers are across all queues Bug 1253896 - Attempts to verify guests are running via SSH fails. SSH connection to guest does not work. 83 fails in 24hrs Bug 1224001 - test_network_basic_ops fails waiting for network to become available 51 fails in 24hrs Bug 1254890 - "Timed out waiting for thing" causes tempest-dsvm-* failures 30 fails in 24hrs We are now sorting - http://status.openstack.org/elastic-recheck/ by failures in the last 24hrs, so we can use it more as a hit list. The top 3 issues are fingerprinted against infra, but are mostly related to normal restart operations at this point. = Starvation Update = with 214 jobs across queues, and averaging 7 devstack nodes per job, our working set is 1498 nodes (i.e. if we had than number we'd be able to be running all the jobs right now in parallel). Our current quota of nodes gives us ~ 480. Which is < 1/3 our working set, and part of the reasons for delays. Rackspace has generously increased our quota in 2 of their availability zones, and Monty is going to prioritize getting those online. Because of Jenkins scaling issues (it starts generating failures when talking to too many build slaves), that means spinning up more Jenkins masters. We've found a 1 / 100 ratio makes Jenkins basically stable, pushing beyond that means new fails. Jenkins is not inherently elastic, so this is a somewhat manual process. Monty is diving on that. There is also a TCP slow start algorthm for zuul that Clark was working on yesterday, which we'll put into production as soon as it is good. This will prevent us from speculating all the way down the gate queue, just to throw it all away on a reset. It acts just like TCP, on every success we grow our speculation length, on every fail we reduce it, with a sane minimum so we don't over throttle ourselves. Thanks to everyone that's been pitching in digging on reset bugs. More help is needed. Many core reviewers are at this point completely ignoring normal reviews until the gate is back, so if you are waiting for a review on some code, the best way to get it, is help us fix the bugs reseting the gate. -Sean -- Sean Dague Samsung Research America s...@dague.net / sean.da...@samsung.com http://dague.net
signature.asc
Description: OpenPGP digital signature
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev