Thanks Salvatore and Jay for sharing your experiences on this issue. I will look through the references you have provided to understand further as well. If I latch onto something, I will share back.
BTW, before posting the question here, I did suspect some race conditions and tried to play around with the timings of some of events - nothing really helped :-( regards.. -Sukhdev On Thu, Jan 9, 2014 at 10:38 AM, Salvatore Orlando <[email protected]>wrote: > Hi Jay, > > replies inline. > I have probably have found one more cause for this issue in the logs, and > I have added a comment to the bug report. > > Salvatore > > > On 9 January 2014 19:10, Jay Pipes <[email protected]> wrote: > >> On Thu, 2014-01-09 at 09:09 +0100, Salvatore Orlando wrote: >> > I am afraid I need to correct you Jay! >> >> I always welcome corrections to things I've gotten wrong, so no worries >> at all! >> >> > This actually appears to be bug 1253896 [1] >> >> Ah, the infamous "SSH bug" :) Yeah, so last night I spent a few hours >> digging through log files and running a variety of e-r queries trying to >> find some patterns for the bugs that Joe G had sent an ML post about. >> >> I went round in circles, unfortunately :( When I thought I'd found a >> pattern, invariably I would doubt my initial findings and wander into >> new areas in a wild goose chase. >> > > that's pretty much what I do all the time. > >> >> At various times, I thought something was up with the DHCP agent, as >> there were lots of "No DHCP Agent found" errors in the q-dhcp screen >> logs. But I could not correlate any relationship with the failures in >> the 4 bugs. >> > > I've seen those warning as well. They are pretty common, and I think they > are actually benign, as the DHCP for the network is configured > asynchronously, it is probably normal to see that message. > 78 > >> >> Then I started thinking that there was a timing/race condition where a >> security group was being added to the Nova-side servers cache before it >> had actually been constructed fully on the Neutron-side. But I was not >> able to fully track down the many, many debug messages that are involved >> in the full sequence of VM launch :( At around 4am, I gave up and went >> to bed... >> > > I have not investigated how this could impact connectivity. However, one > thing that it's not ok in my opinion is that we have no way to know whether > a security group is enforced or not; I think it needs an 'operational > status'. > Note: we're working on a patch for the nicira plugin to add this concept; > it's currently being developed as a plugin-specific extension, but if there > is interest to support the concept also in the ml2 plugin I think we can > just make it part of the 'core' security group API. > > >> > Technically, what we call 'bug' here is actually a failure >> > manifestation. >> > So far, we have removed several bugs causing this failure. The last >> > patch was pushed to devstack around Christmas. >> > Nevertheless, if you look at recent comments and Joe's email, we still >> > have a non-negligible failure rate on the gate. >> >> Understood. I suspect actually that some of the various performance >> improvements from Phil Day and others around optimizing certain server >> and secgroup list calls have made the underlying race conditions show up >> more often -- since the list calls are completing much faster, which >> ironically gives Neutron less time to complete setup operations! >> > > That might be one explanation. The other might be the fact that we added > another scenario test for neutron which creates more vms with floating ips > and stuff, thus increasing the chances of hitting the timeout failure. > >> >> So, a performance patch on the Nova side ends up putting more pressure >> on the Neutron side, which causes the rate of occurrence for these >> sticky bugs (with potentially many root causes) to spike. >> >> Such is life I guess :) >> >> > It is also worth mentioning that if you are running your tests with >> > parallelism enabled (ie: you're running tempest with tox -esmoke >> > rather than tox -esmokeserial) you will end up with a higher >> > occurrence of this failure due to more bugs causing it. These bugs are >> > due to some weakness in the OVS agent that we are addressing with >> > patches for blueprint neutron-tempest-parallel [2]. >> >> Interesting. If you wouldn't mind, what makes you think this is a >> weakness in the OVS agent? I would certainly appreciate your expertise >> in this area, since it would help me in my own bug-searching endeavors. >> >> > Basically those are all the patches addressing the linked blueprint; I > have added more info in the commit messages for the patches. > Also some of those patches target this bug as well: > https://bugs.launchpad.net/neutron/+bug/1253993 > > >> All the best, >> -jay >> >> > Regards, >> > Salvatore >> > >> > >> > >> > >> > [1] https://bugs.launchpad.net/neutron/+bug/1253896 >> > [2] >> https://blueprints.launchpad.net/neutron/+spec/neutron-tempest-parallel >> > >> > >> > On 9 January 2014 05:38, Jay Pipes <[email protected]> wrote: >> > On Wed, 2014-01-08 at 18:46 -0800, Sukhdev Kapur wrote: >> > > Dear fellow developers, >> > >> > > I am running few Neutron tempest tests and noticing an >> > intermittent >> > > failure of tempest.scenario.test_network_basic_ops. >> > >> > > I ran this test 50+ times and am getting intermittent >> > failure. The >> > > pass rate is apps. 70%. The 30% of the time it fails mostly >> > in >> > > _check_public_network_connectivity. >> > >> > > Has anybody seen this? >> > > If there is a fix or work around for this, please share your >> > wisdom. >> > >> > >> > Unfortunately, I believe you are running into this bug: >> > >> > https://bugs.launchpad.net/nova/+bug/1254890 >> > >> > The bug is Triaged in Nova (meaning, there is a suggested fix >> > in the bug >> > report). It's currently affecting the gate negatively and is >> > certainly >> > on the radar of the various PTLs affected. >> > >> > Best, >> > -jay >> > >> > >> > >> > _______________________________________________ >> > OpenStack-dev mailing list >> > [email protected] >> > >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> > >> > >> > _______________________________________________ >> > OpenStack-dev mailing list >> > [email protected] >> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> >> >> >> _______________________________________________ >> OpenStack-dev mailing list >> [email protected] >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> > > > _______________________________________________ > OpenStack-dev mailing list > [email protected] > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > >
_______________________________________________ OpenStack-dev mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
