On Sun, Nov 24, 2013 at 9:58 PM, Robert Collins <robe...@robertcollins.net>wrote:
> I have a proposal - I think we should mark all recheck bugs critical, > and the respective project PTLs should actively shop around amongst > their contributors to get them fixed before other work: we should > drive the known set of nondeterministic issues down to 0 and keep it > there. > Yes! In fact we are already working towards that. See http://lists.openstack.org/pipermail/openstack-dev/2013-November/020048.html > > -Rob > > On 25 November 2013 18:00, Joe Gordon <joe.gord...@gmail.com> wrote: > > Hi All, > > > > TL;DR Last week the gate got wedged on nondeterministic failures. > Unwedging > > the gate required drastic actions to fix bugs. > > > > Starting on November 15th, gate jobs have been getting progressively less > > stable with not enough attention given to fixing the issues, until we > got to > > the point where the gate was almost fully wedged. No one bug caused > this, > > it was a collection of bugs that got us here. The gate protects us from > code > > that fails 100% of the time, but if a patch fails 10% of the time it can > > slip through. Add a few of these bugs together and we get the gate to a > > point where the gate is fully wedged and fixing it without circumventing > the > > gate (something we never want to do) is very hard. It took just 2 new > > nondeterministic bugs to take us from a gate that mostly worked, to a > gate > > that was almost fully wedged. Last week we found out Jeremy Stanley > (fungi) > > was right when he said, "nondeterministic failures breed more > > nondeterministic failures, because people are so used to having to > reverify > > their patches to get them to merge that they are doing so even when it's > > their patch which is introducing a nondeterministic bug." > > > > Side note: This is not the first time we wedge the gate, the first time > was > > around September 26th, right when we were cutting Havana release > candidates. > > In response we wrote elastic-recheck > > (http://status.openstack.org/elastic-recheck/) to better track what > bugs we > > were seeing. > > > > Gate stability according to Graphite: > http://paste.openstack.org/show/53765/ > > (they are huge because they encode entire queries, so including as a > > pastebin). > > > > After sending out an email to ask for help fixing the top known gate bugs > > ( > http://lists.openstack.org/pipermail/openstack-dev/2013-November/019826.html > ), > > we had a few possible fixes. But with the gate wedged, the merge queue > was > > 145 patches long and could take days to be processed. In the worst case, > > none of the patches merging, it would take about 1 hour per patch. So on > > November 20th we asked for a freeze on any non-critical bug fixes ( > > > http://lists.openstack.org/pipermail/openstack-dev/2013-November/019941.html > > ), and kicked everything out of the merge queue and put our possible bug > > fixes at the front. Even with these drastic measures it still took 26 > hours > > to finally unwedge the gate. In 26 hours we got the check queue failure > rate > > (always higher then the gate failure rate) down from around 87% failure > to > > below 10% failure. And we still have many more bugs to track down and > fix in > > order to improve gate stability. > > > > > > 8 Major bug fixes later, we have the gate back to a reasonable failure > rate. > > But how did things get so bad? I'm glad you asked, here is a blow by blow > > account. > > > > The gate has not been completely stable for a very long time, and it only > > took two new bugs to wedge the gate. Starting with the list of bugs we > > identified via elastic-recheck, we fixed 4 bugs that have been in the > gate > > for a few weeks already. > > > > > > https://bugs.launchpad.net/bugs/1224001 "test_network_basic_ops fails > > waiting for network to become available" > > > > https://review.openstack.org/57290 was the fix which depended on > > https://review.openstack.org/53188 and > https://review.openstack.org/57475. > > > > This fixed a race condition where the IP address from DHCP was not > received > > by the VM at the right time. Minimize polling on the agent is now > defaulted > > to True, which should reduce the time needed for configuring an > interface on > > br-int consistently. > > > > https://bugs.launchpad.net/bugs/1252514 "Swift returning errors when > setup > > using devstack" > > > > Fix https://review.openstack.org/#/c/57373/ > > > > There were a few swift related problems that were sorted out as well. > Most > > had to do with tuning swift properly for its use as a glance backend in > the > > gate, ensuring that timeout values were appropriate for the devstack test > > slaves (in > > > > resource constrained environments, the swift default timeouts could be > > tripped frequently (logs showed the request would have finished > successfully > > given enough time)). Swift also had a race-condition in how it > constructed > > its sqlite3 > > > > files for containers and accounts, where it was not retrying operations > when > > the database was locked. > > > > https://bugs.launchpad.net/swift/+bug/1243973 "Simultaneous PUT > requests for > > the same account..." > > > > Fix https://review.openstack.org/#/c/57019/ > > > > This was not on our original list of bugs, but while in bug fix mode, we > got > > this one fixed as well > > > > https://bugs.launchpad.net/bugs/1251784 "nova+neutron scheduling error: > > Connection to neutron failed: Maximum attempts reached > > > > Fix https://review.openstack.org/#/c/57509/ > > > > Uncovered on mailing list > > ( > http://lists.openstack.org/pipermail/openstack-dev/2013-November/019906.html > ) > > > > Nova had a very old version of oslo's local.py which is used for managing > > references to local variables in coroutines. The old version had a pretty > > significant bug that basically meant non-weak references to variables > were > > not managed properly. This fix has made the nova neutron interactions > much > > more reliable. > > > > This fixed the number 2 bug on our list of top gate bugs > > ( > http://lists.openstack.org/pipermail/openstack-dev/2013-November/019826.html > > )! > > > > > > In addition to fixing 4 old bugs, we fixed two new bugs that were > introduced > > / exposed this week. > > > > https://bugs.launchpad.net/bugs/1251920 "Tempest failures due to > failure to > > return console logs from an instance Project" > > > > Bug: https://review.openstack.org/#/c/54363/ [Tempest] > > > > Fix(work around): https://review.openstack.org/#/c/57193/ > > > > After many false starts and banging our head against the wall, we > identified > > a change to tempest, https://review.openstack.org/54363 , that added a > new > > test around the same time as bug 1251920 became a problem. Forcing > tempest > > to skip this test had a very high incidence of success without any > 1251920 > > related failures. As a result we are working arond this bug by skipping > that > > test, until it can be run without major impact to the gate. > > > > The change that introduced this problematic test had to go through the > gate > > four times before it would merge, though only one of the 3 failed attemps > > appears to have triggered 1251920. Or as Jeremy Stanley (fungi) said > > "nondeterministic failures breed more nondeterministic failures, because > > people are so used to having to reverify their patches to get them to > merge > > that they are doing so even when it's their patch which is introducing a > > nondeterministic bug." > > > > https://bugs.launchpad.net/bugs/1252170 "tempest.scenario > > test_resize_server_confirm failed in grenade" > > > > Fix https://review.openstack.org/#/c/57357/ > > > > Fix https://review.openstack.org/#/c/57572/ > > > > First we started running post Grenade upgrade tests in parallel (to fix > > another bug) which would normally be fine, but Grenade wasn't configuring > > the small flavors typically used by tempest so it was possible for the > > devstack Jenkins slaves to run out of memory when starting many larger > VMs > > in parallel. To fix this devstack lib/tempest has been updated to create > the > > flavors only if they don't exist and Grenade is allowing tempest to use > its > > default instance flavors. > > > > > > > > Now that we have the gate back into working order, we are working on the > > next steps to prevent this from happening again. The two most immediate > > changes are: > > > > Doing a better job of triaging gate bugs > > ( > http://lists.openstack.org/pipermail/openstack-dev/2013-November/020048.html > > ). > > > > In the next few days we will remove 'reverify no bug' (although you will > > still be able to run 'reverify bug x'. > > > > > > Best, > > Joe Gordon > > Clark Boylan > > > > _______________________________________________ > > OpenStack-dev mailing list > > OpenStack-dev@lists.openstack.org > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > > > -- > Robert Collins <rbtcoll...@hp.com> > Distinguished Technologist > HP Converged Cloud > > _______________________________________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev