On Tue, Apr 24, 2018 at 3:17 PM, Ravi Shankar Nori <[email protected]> wrote:
> > > On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg <[email protected]> wrote: > >> Ravi's patch is in, but a similar problem remains, and the test cannot >> be put back into its place. >> >> It seems that while Vdsm was taken down, a couple of getCapsAsync >> requests queued up. At one point, the host resumed its connection, >> before the requests have been cleared of the queue. After the host is >> up, the following tests resume, and at a pseudorandom point in time, >> an old getCapsAsync request times out and kills our connection. >> >> I believe that as long as ANY request is on flight, the monitoring >> lock should not be released, and the host should not be declared as >> up. >> >> >> > > Hi Dan, > > Can I have the link to the job on jenkins so I can look at the logs > http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/ > > > >> On Wed, Apr 11, 2018 at 1:04 AM, Ravi Shankar Nori <[email protected]> >> wrote: >> > This [1] should fix the multiple release lock issue >> > >> > [1] https://gerrit.ovirt.org/#/c/90077/ >> > >> > On Tue, Apr 10, 2018 at 3:53 PM, Ravi Shankar Nori <[email protected]> >> wrote: >> >> >> >> Working on a patch will post a fix >> >> >> >> Thanks >> >> >> >> Ravi >> >> >> >> On Tue, Apr 10, 2018 at 9:14 AM, Alona Kaplan <[email protected]> >> wrote: >> >>> >> >>> Hi all, >> >>> >> >>> Looking at the log it seems that the new GetCapabilitiesAsync is >> >>> responsible for the mess. >> >>> >> >>> - 08:29:47 - engine loses connectivity to host >> >>> 'lago-basic-suite-4-2-host-0'. >> >>> >> >>> - Every 3 seconds a getCapabalititiesAsync request is sent to the host >> >>> (unsuccessfully). >> >>> >> >>> * before each "getCapabilitiesAsync" the monitoring lock is taken >> >>> (VdsManager,refreshImpl) >> >>> >> >>> * "getCapabilitiesAsync" immediately fails and throws >> >>> 'VDSNetworkException: java.net.ConnectException: Connection refused'. >> The >> >>> exception is caught by >> >>> 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls >> >>> 'onFailure' of the callback and re-throws the exception. >> >>> >> >>> catch (Throwable t) { >> >>> getParameters().getCallback().onFailure(t); >> >>> throw t; >> >>> } >> >>> >> >>> * The 'onFailure' of the callback releases the "monitoringLock" >> >>> ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) >> >>> lockManager.releaseLock(monitoringLock);') >> >>> >> >>> * 'VdsManager,refreshImpl' catches the network exception, marks >> >>> 'releaseLock = true' and tries to release the already released lock. >> >>> >> >>> The following warning is printed to the log - >> >>> >> >>> WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] >> >>> (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to >> release >> >>> exclusive lock which does not exist, lock key: >> >>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT' >> >>> >> >>> >> >>> - 08:30:51 a successful getCapabilitiesAsync is sent. >> >>> >> >>> - 08:32:55 - The failing test starts (Setup Networks for setting >> ipv6). >> >>> >> >>> >> >>> * SetupNetworks takes the monitoring lock. >> >>> >> >>> - 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests >> >>> from 4 minutes ago from its queue and prints a VDSNetworkException: >> Vds >> >>> timeout occured. >> >>> >> >>> * When the first request is removed from the queue >> >>> ('ResponseTracker.remove()'), the 'Callback.onFailure' is invoked >> (for the >> >>> second time) -> monitoring lock is released (the lock taken by the >> >>> SetupNetworks!). >> >>> >> >>> * The other requests removed from the queue also try to release >> the >> >>> monitoring lock, but there is nothing to release. >> >>> >> >>> * The following warning log is printed - >> >>> WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] >> >>> (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to >> release >> >>> exclusive lock which does not exist, lock key: >> >>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT' >> >>> >> >>> - 08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is >> started. >> >>> Why? I'm not 100% sure but I guess the late processing of the >> >>> 'getCapabilitiesAsync' that causes losing of the monitoring lock and >> the >> >>> late + mupltiple processing of failure is root cause. >> >>> >> >>> >> >>> Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is >> >>> trying to be released three times. Please share your opinion >> regarding how >> >>> it should be fixed. >> >>> >> >>> >> >>> Thanks, >> >>> >> >>> Alona. >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <[email protected]> >> wrote: >> >>>> >> >>>> On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <[email protected]> >> wrote: >> >>>>> >> >>>>> >> >>>>> >> >>>>> On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <[email protected]> wrote: >> >>>>>> >> >>>>>> Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851. >> >>>>>> Is it still failing? >> >>>>>> >> >>>>>> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <[email protected]> >> >>>>>> wrote: >> >>>>>>> >> >>>>>>> On 7 April 2018 at 00:30, Dan Kenigsberg <[email protected]> >> wrote: >> >>>>>>> > No, I am afraid that we have not managed to understand why >> setting >> >>>>>>> > and >> >>>>>>> > ipv6 address too the host off the grid. We shall continue >> >>>>>>> > researching >> >>>>>>> > this next week. >> >>>>>>> > >> >>>>>>> > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks >> old, >> >>>>>>> > but >> >>>>>>> > could it possibly be related (I really doubt that)? >> >>>>>>> > >> >>>>> >> >>>>> >> >>>>> Sorry, but I do not see how this problem is related to VDSM. >> >>>>> There is nothing that indicates that there is a VDSM problem. >> >>>>> >> >>>>> Has the RPC connection between Engine and VDSM failed? >> >>>>> >> >>>> >> >>>> Further up the thread, Piotr noticed that (at least on one failure of >> >>>> this test) that the Vdsm host lost connectivity to its storage, and >> Vdsm >> >>>> process was restarted. However, this does not seems to happen in all >> cases >> >>>> where this test fails. >> >>>> >> >>>> _______________________________________________ >> >>>> Devel mailing list >> >>>> [email protected] >> >>>> http://lists.ovirt.org/mailman/listinfo/devel >> >>> >> >>> >> >> >> > >> > > -- Martin Perina Associate Manager, Software Engineering Red Hat Czech s.r.o.
_______________________________________________ Devel mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/devel
