On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg <dan...@redhat.com> wrote:
> Ravi's patch is in, but a similar problem remains, and the test cannot > be put back into its place. > > It seems that while Vdsm was taken down, a couple of getCapsAsync > requests queued up. At one point, the host resumed its connection, > before the requests have been cleared of the queue. After the host is > up, the following tests resume, and at a pseudorandom point in time, > an old getCapsAsync request times out and kills our connection. > > I believe that as long as ANY request is on flight, the monitoring > lock should not be released, and the host should not be declared as > up. > > > Hi Dan, Can I have the link to the job on jenkins so I can look at the logs > On Wed, Apr 11, 2018 at 1:04 AM, Ravi Shankar Nori <rn...@redhat.com> > wrote: > > This [1] should fix the multiple release lock issue > > > > [1] https://gerrit.ovirt.org/#/c/90077/ > > > > On Tue, Apr 10, 2018 at 3:53 PM, Ravi Shankar Nori <rn...@redhat.com> > wrote: > >> > >> Working on a patch will post a fix > >> > >> Thanks > >> > >> Ravi > >> > >> On Tue, Apr 10, 2018 at 9:14 AM, Alona Kaplan <alkap...@redhat.com> > wrote: > >>> > >>> Hi all, > >>> > >>> Looking at the log it seems that the new GetCapabilitiesAsync is > >>> responsible for the mess. > >>> > >>> - 08:29:47 - engine loses connectivity to host > >>> 'lago-basic-suite-4-2-host-0'. > >>> > >>> - Every 3 seconds a getCapabalititiesAsync request is sent to the host > >>> (unsuccessfully). > >>> > >>> * before each "getCapabilitiesAsync" the monitoring lock is taken > >>> (VdsManager,refreshImpl) > >>> > >>> * "getCapabilitiesAsync" immediately fails and throws > >>> 'VDSNetworkException: java.net.ConnectException: Connection refused'. > The > >>> exception is caught by > >>> 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls > >>> 'onFailure' of the callback and re-throws the exception. > >>> > >>> catch (Throwable t) { > >>> getParameters().getCallback().onFailure(t); > >>> throw t; > >>> } > >>> > >>> * The 'onFailure' of the callback releases the "monitoringLock" > >>> ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) > >>> lockManager.releaseLock(monitoringLock);') > >>> > >>> * 'VdsManager,refreshImpl' catches the network exception, marks > >>> 'releaseLock = true' and tries to release the already released lock. > >>> > >>> The following warning is printed to the log - > >>> > >>> WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] > >>> (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to > release > >>> exclusive lock which does not exist, lock key: > >>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT' > >>> > >>> > >>> - 08:30:51 a successful getCapabilitiesAsync is sent. > >>> > >>> - 08:32:55 - The failing test starts (Setup Networks for setting ipv6). > >>> > >>> > >>> * SetupNetworks takes the monitoring lock. > >>> > >>> - 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests > >>> from 4 minutes ago from its queue and prints a VDSNetworkException: Vds > >>> timeout occured. > >>> > >>> * When the first request is removed from the queue > >>> ('ResponseTracker.remove()'), the 'Callback.onFailure' is invoked (for > the > >>> second time) -> monitoring lock is released (the lock taken by the > >>> SetupNetworks!). > >>> > >>> * The other requests removed from the queue also try to release > the > >>> monitoring lock, but there is nothing to release. > >>> > >>> * The following warning log is printed - > >>> WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] > >>> (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to > release > >>> exclusive lock which does not exist, lock key: > >>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT' > >>> > >>> - 08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started. > >>> Why? I'm not 100% sure but I guess the late processing of the > >>> 'getCapabilitiesAsync' that causes losing of the monitoring lock and > the > >>> late + mupltiple processing of failure is root cause. > >>> > >>> > >>> Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is > >>> trying to be released three times. Please share your opinion regarding > how > >>> it should be fixed. > >>> > >>> > >>> Thanks, > >>> > >>> Alona. > >>> > >>> > >>> > >>> > >>> > >>> > >>> On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <dan...@redhat.com> > wrote: > >>>> > >>>> On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <eh...@redhat.com> wrote: > >>>>> > >>>>> > >>>>> > >>>>> On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <ee...@redhat.com> wrote: > >>>>>> > >>>>>> Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851. > >>>>>> Is it still failing? > >>>>>> > >>>>>> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <bkor...@redhat.com> > >>>>>> wrote: > >>>>>>> > >>>>>>> On 7 April 2018 at 00:30, Dan Kenigsberg <dan...@redhat.com> > wrote: > >>>>>>> > No, I am afraid that we have not managed to understand why > setting > >>>>>>> > and > >>>>>>> > ipv6 address too the host off the grid. We shall continue > >>>>>>> > researching > >>>>>>> > this next week. > >>>>>>> > > >>>>>>> > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old, > >>>>>>> > but > >>>>>>> > could it possibly be related (I really doubt that)? > >>>>>>> > > >>>>> > >>>>> > >>>>> Sorry, but I do not see how this problem is related to VDSM. > >>>>> There is nothing that indicates that there is a VDSM problem. > >>>>> > >>>>> Has the RPC connection between Engine and VDSM failed? > >>>>> > >>>> > >>>> Further up the thread, Piotr noticed that (at least on one failure of > >>>> this test) that the Vdsm host lost connectivity to its storage, and > Vdsm > >>>> process was restarted. However, this does not seems to happen in all > cases > >>>> where this test fails. > >>>> > >>>> _______________________________________________ > >>>> Devel mailing list > >>>> Devel@ovirt.org > >>>> http://lists.ovirt.org/mailman/listinfo/devel > >>> > >>> > >> > > >
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel