Joe Gordon has been doing great working tracking test failures and how often they affect us. Post Havana release the failure rate has increased dramatically, negatively affecting the gate and forcing it to run in a near worst case scenario. That is changes are being tested in parallel but the head of the queue is more often than not running into a failed job forcing all changes behind it to be retested and so on.
This led to a gate queue 130 deep with the head of the queue 18 hours behind its approval. We have identified fixes for some of the worst current bugs and in order to get them in have restarted Zuul effectively cancelling the gate queue and have queued these changes up at the front of the qeueue. Once these changes are in and we are happy with the bug fixing results we will requeue changes that were in the queue when it got cancelled. How do we avoid this in the future? Step one is reviewers that are approving changes (or reverifying them) should keep an eye on the gate queue. If it is struggling adding more changes to that queue problably won't help. Instead we should focus on identifying the bugs, submitting changes to elastic-recheck to track these bugs, and work towards fixing the bugs. Everyone is affected by persistent gate failures, we need to work together to fix them. Thank you for your patience, Clark _______________________________________________ OpenStack-dev mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
