On Tue, Apr 24, 2018 at 9:24 AM, Martin Perina <[email protected]> wrote:
> > > On Tue, Apr 24, 2018 at 3:17 PM, Ravi Shankar Nori <[email protected]> > wrote: > >> >> >> On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg <[email protected]> >> wrote: >> >>> Ravi's patch is in, but a similar problem remains, and the test cannot >>> be put back into its place. >>> >>> It seems that while Vdsm was taken down, a couple of getCapsAsync >>> requests queued up. At one point, the host resumed its connection, >>> before the requests have been cleared of the queue. After the host is >>> up, the following tests resume, and at a pseudorandom point in time, >>> an old getCapsAsync request times out and kills our connection. >>> >>> I believe that as long as ANY request is on flight, the monitoring >>> lock should not be released, and the host should not be declared as >>> up. >>> >>> >>> >> >> Hi Dan, >> >> Can I have the link to the job on jenkins so I can look at the logs >> > > http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/ > > > >From the logs the only VDS lock that is being released twice is VDS_FENCE lock. Opened a BZ [1] for it. Will post a fix [1] https://bugzilla.redhat.com/show_bug.cgi?id=1571300 > >> >> >>> On Wed, Apr 11, 2018 at 1:04 AM, Ravi Shankar Nori <[email protected]> >>> wrote: >>> > This [1] should fix the multiple release lock issue >>> > >>> > [1] https://gerrit.ovirt.org/#/c/90077/ >>> > >>> > On Tue, Apr 10, 2018 at 3:53 PM, Ravi Shankar Nori <[email protected]> >>> wrote: >>> >> >>> >> Working on a patch will post a fix >>> >> >>> >> Thanks >>> >> >>> >> Ravi >>> >> >>> >> On Tue, Apr 10, 2018 at 9:14 AM, Alona Kaplan <[email protected]> >>> wrote: >>> >>> >>> >>> Hi all, >>> >>> >>> >>> Looking at the log it seems that the new GetCapabilitiesAsync is >>> >>> responsible for the mess. >>> >>> >>> >>> - 08:29:47 - engine loses connectivity to host >>> >>> 'lago-basic-suite-4-2-host-0'. >>> >>> >>> >>> - Every 3 seconds a getCapabalititiesAsync request is sent to the >>> host >>> >>> (unsuccessfully). >>> >>> >>> >>> * before each "getCapabilitiesAsync" the monitoring lock is >>> taken >>> >>> (VdsManager,refreshImpl) >>> >>> >>> >>> * "getCapabilitiesAsync" immediately fails and throws >>> >>> 'VDSNetworkException: java.net.ConnectException: Connection >>> refused'. The >>> >>> exception is caught by >>> >>> 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls >>> >>> 'onFailure' of the callback and re-throws the exception. >>> >>> >>> >>> catch (Throwable t) { >>> >>> getParameters().getCallback().onFailure(t); >>> >>> throw t; >>> >>> } >>> >>> >>> >>> * The 'onFailure' of the callback releases the "monitoringLock" >>> >>> ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) >>> >>> lockManager.releaseLock(monitoringLock);') >>> >>> >>> >>> * 'VdsManager,refreshImpl' catches the network exception, marks >>> >>> 'releaseLock = true' and tries to release the already released lock. >>> >>> >>> >>> The following warning is printed to the log - >>> >>> >>> >>> WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] >>> >>> (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to >>> release >>> >>> exclusive lock which does not exist, lock key: >>> >>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT' >>> >>> >>> >>> >>> >>> - 08:30:51 a successful getCapabilitiesAsync is sent. >>> >>> >>> >>> - 08:32:55 - The failing test starts (Setup Networks for setting >>> ipv6). >>> >>> >>> >>> >>> >>> * SetupNetworks takes the monitoring lock. >>> >>> >>> >>> - 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests >>> >>> from 4 minutes ago from its queue and prints a VDSNetworkException: >>> Vds >>> >>> timeout occured. >>> >>> >>> >>> * When the first request is removed from the queue >>> >>> ('ResponseTracker.remove()'), the 'Callback.onFailure' is invoked >>> (for the >>> >>> second time) -> monitoring lock is released (the lock taken by the >>> >>> SetupNetworks!). >>> >>> >>> >>> * The other requests removed from the queue also try to >>> release the >>> >>> monitoring lock, but there is nothing to release. >>> >>> >>> >>> * The following warning log is printed - >>> >>> WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] >>> >>> (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to >>> release >>> >>> exclusive lock which does not exist, lock key: >>> >>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT' >>> >>> >>> >>> - 08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is >>> started. >>> >>> Why? I'm not 100% sure but I guess the late processing of the >>> >>> 'getCapabilitiesAsync' that causes losing of the monitoring lock and >>> the >>> >>> late + mupltiple processing of failure is root cause. >>> >>> >>> >>> >>> >>> Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is >>> >>> trying to be released three times. Please share your opinion >>> regarding how >>> >>> it should be fixed. >>> >>> >>> >>> >>> >>> Thanks, >>> >>> >>> >>> Alona. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <[email protected]> >>> wrote: >>> >>>> >>> >>>> On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <[email protected]> >>> wrote: >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <[email protected]> >>> wrote: >>> >>>>>> >>> >>>>>> Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851. >>> >>>>>> Is it still failing? >>> >>>>>> >>> >>>>>> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <[email protected]> >>> >>>>>> wrote: >>> >>>>>>> >>> >>>>>>> On 7 April 2018 at 00:30, Dan Kenigsberg <[email protected]> >>> wrote: >>> >>>>>>> > No, I am afraid that we have not managed to understand why >>> setting >>> >>>>>>> > and >>> >>>>>>> > ipv6 address too the host off the grid. We shall continue >>> >>>>>>> > researching >>> >>>>>>> > this next week. >>> >>>>>>> > >>> >>>>>>> > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks >>> old, >>> >>>>>>> > but >>> >>>>>>> > could it possibly be related (I really doubt that)? >>> >>>>>>> > >>> >>>>> >>> >>>>> >>> >>>>> Sorry, but I do not see how this problem is related to VDSM. >>> >>>>> There is nothing that indicates that there is a VDSM problem. >>> >>>>> >>> >>>>> Has the RPC connection between Engine and VDSM failed? >>> >>>>> >>> >>>> >>> >>>> Further up the thread, Piotr noticed that (at least on one failure >>> of >>> >>>> this test) that the Vdsm host lost connectivity to its storage, and >>> Vdsm >>> >>>> process was restarted. However, this does not seems to happen in >>> all cases >>> >>>> where this test fails. >>> >>>> >>> >>>> _______________________________________________ >>> >>>> Devel mailing list >>> >>>> [email protected] >>> >>>> http://lists.ovirt.org/mailman/listinfo/devel >>> >>> >>> >>> >>> >> >>> > >>> >> >> > > > -- > Martin Perina > Associate Manager, Software Engineering > Red Hat Czech s.r.o. >
_______________________________________________ Devel mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/devel
