On Wednesday, November 20, 2013 2:44:52 PM, Clark Boylan wrote:
Joe Gordon has been doing great working tracking test failures and how
often they affect us. Post Havana release the failure rate has
increased dramatically, negatively affecting the gate and forcing it to
run in a near worst case scenario. That is changes are being tested in
parallel but the head of the queue is more often than not running into a
failed job forcing all changes behind it to be retested and so on.

This led to a gate queue 130 deep with the head of the queue 18 hours
behind its approval. We have identified fixes for some of the worst
current bugs and in order to get them in have restarted Zuul effectively
cancelling the gate queue and have queued these changes up at the front
of the qeueue. Once these changes are in and we are happy with the bug
fixing results we will requeue changes that were in the queue when it
got cancelled.

How do we avoid this in the future? Step one is reviewers that are
approving changes (or reverifying them) should keep an eye on the gate
queue. If it is struggling adding more changes to that queue problably
won't help. Instead we should focus on identifying the bugs, submitting
changes to elastic-recheck to track these bugs, and work towards fixing
the bugs. Everyone is affected by persistent gate failures, we need to
work together to fix them.

Thank you for your patience,

Clark

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Let me also say that I think it's really helpful that Joe has been sending out recaps to the mailing list about the top offenders so people can help pitch in on investigating and fixing those (like we saw with the Neutron team's response to Joe's recent post about the top gate failures).

People get heads-down in their own projects and what they are working on and it's hard to keep up with what's going on in the infra channel (or nova channel for that matter), so sending out a recap that everyone can see in the mailing list is helpful to reset where things are at and focus possibly various isolated investigations (as we saw happen this week).

--

Thanks,

Matt Riedemann


_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to