Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

Michal Skrivanek Thu, 12 Apr 2018 09:30:45 -0700


> On 12 Apr 2018, at 17:15, Dafna Ron <[email protected]> wrote:
> 
> hi,
> 
> we are failing randomly on test 006_migrations.migrate_vm with what seems to 
> be the same issue. 
> 
> the vm seems to be migrated successfully but engine thinks that it failed and 
> re-calls migration getting a response of vm already exists. 
> 
> I don't think this is an issue with the test but rather a regression so I 
> opened a bug:


i do not think so, I’ve heard someone removed a test in between migrating A->B 
and B->A?
If that’s the case that is the real issue. 
You can’t migrate back to A without waiting for A to be cleared out properly
https://gerrit.ovirt.org/#/c/90166/ <https://gerrit.ovirt.org/#/c/90166/> 
should fix it

> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1566594 
> <https://bugzilla.redhat.com/show_bug.cgi?id=1566594>
> 
> Thanks, 
> Dafna
> 
> 
> On Wed, Apr 11, 2018 at 1:52 PM, Milan Zamazal <[email protected] 
> <mailto:[email protected]>> wrote:
> Arik Hadas <[email protected] <mailto:[email protected]>> writes:
> 
> > On Wed, Apr 11, 2018 at 12:45 PM, Alona Kaplan <[email protected] 
> > <mailto:[email protected]>> wrote:
> >
> >>
> >>
> >> On Tue, Apr 10, 2018 at 6:52 PM, Gal Ben Haim <[email protected] 
> >> <mailto:[email protected]>> wrote:
> >>
> >>> I'm seeing the same error in [1], during 006_migrations.migrate_vm.
> >>>
> >>> [1] http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1650/ 
> >>> <http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1650/>
> >>>
> >>
> >> Seems like another bug. The migration failed since for some reason the vm
> >> is already defined on the destination host.
> >>
> >> 2018-04-10 11:08:08,685-0400 ERROR (jsonrpc/0) [api] FINISH create
> >> error=Virtual machine already exists (api:129)
> >> Traceback (most recent call last):
> >> File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in
> >> method
> >> ret = func(*args, **kwargs)
> >> File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 191, in create
> >> raise exception.VMExists()
> >> VMExists: Virtual machine already exists
> >>
> >>
> > Milan, Francesco, could it be that because of [1] that appears on the
> > destination host right after shutting down the VM, it remained defined on
> > that host?
> 
> I can't see any destroy call in the logs after the successful preceding
> migration from the given host.  That would explain “VMExists” error.
> 
> > [1] 2018-04-10 11:01:40,005-0400 ERROR (libvirt/events) [vds] Error running
> > VM callback (clientIF:683)
> >
> > Traceback (most recent call last):
> >
> >   File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 646, in
> > dispatchLibvirtEvents
> >
> >     v.onLibvirtLifecycleEvent(event, detail, None)
> >
> > AttributeError: 'NoneType' object has no attribute 'onLibvirtLifecycleEvent'
> 
> That means that a life cycle event on an unknown VM has arrived, in this
> case apparently destroy event, following the destroy call after the
> failed incoming migration.  The reported AttributeError is a minor bug,
> already fixed in master.  So it's most likely unrelated to the discussed
> problem.
> 
> >>> On Tue, Apr 10, 2018 at 4:14 PM, Alona Kaplan <[email protected] 
> >>> <mailto:[email protected]>>
> >>> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> Looking at the log it seems that the new GetCapabilitiesAsync is
> >>>> responsible for the mess.
> >>>>
> >>>> -
> >>>> * 08:29:47 - engine loses connectivity to host 
> >>>> 'lago-basic-suite-4-2-host-0'.*
> >>>>
> >>>>
> >>>>
> >>>> *- Every 3 seconds a getCapabalititiesAsync request is sent to the host 
> >>>> (unsuccessfully).*
> >>>>
> >>>>      * before each "getCapabilitiesAsync" the monitoring lock is taken 
> >>>> (VdsManager,refreshImpl)
> >>>>
> >>>>      * "getCapabilitiesAsync" immediately fails and throws 
> >>>> 'VDSNetworkException: java.net.ConnectException: Connection refused'. 
> >>>> The exception is caught by 
> >>>> 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls 
> >>>> 'onFailure' of the callback and re-throws the exception.
> >>>>
> >>>>          catch (Throwable t) {
> >>>>             getParameters().getCallback().onFailure(t);
> >>>>             throw t;
> >>>>          }
> >>>>
> >>>>     * The 'onFailure' of the callback releases the "monitoringLock" 
> >>>> ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) 
> >>>> lockManager.releaseLock(monitoringLock);')
> >>>>
> >>>>     * 'VdsManager,refreshImpl' catches the network exception, marks 
> >>>> 'releaseLock = true' and *tries to release the already released lock*.
> >>>>
> >>>>       The following warning is printed to the log -
> >>>>
> >>>>       WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] 
> >>>> (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release 
> >>>> exclusive lock which does not exist, lock key: 
> >>>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> *- 08:30:51 a successful getCapabilitiesAsync is sent.*
> >>>>
> >>>>
> >>>> *- 08:32:55 - The failing test starts (Setup Networks for setting ipv6). 
> >>>>    *
> >>>>
> >>>>     * SetupNetworks takes the monitoring lock.
> >>>>
> >>>> *- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests 
> >>>> from 4 minutes ago from its queue and prints a VDSNetworkException: Vds 
> >>>> timeout occured.*
> >>>>
> >>>>       * When the first request is removed from the queue 
> >>>> ('ResponseTracker.remove()'), the
> >>>> *'Callback.onFailure' is invoked (for the second time) -> monitoring 
> >>>> lock is released (the lock taken by the SetupNetworks!).*
> >>>>
> >>>>       * *The other requests removed from the queue also try to release 
> >>>> the monitoring lock*, but there is nothing to release.
> >>>>
> >>>>       * The following warning log is printed -
> >>>>         WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] 
> >>>> (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release 
> >>>> exclusive lock which does not exist, lock key: 
> >>>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
> >>>>
> >>>> - *08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is 
> >>>> started*. Why? I'm not 100% sure but I guess the late processing of the 
> >>>> 'getCapabilitiesAsync' that causes losing of the monitoring lock and the 
> >>>> late + mupltiple processing of failure is root cause.
> >>>>
> >>>>
> >>>> Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is 
> >>>> trying to be released three times. Please share your opinion regarding 
> >>>> how it should be fixed.
> >>>>
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Alona.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <[email protected] 
> >>>> <mailto:[email protected]>>
> >>>> wrote:
> >>>>
> >>>>> On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <[email protected] 
> >>>>> <mailto:[email protected]>> wrote:
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <[email protected] 
> >>>>>> <mailto:[email protected]>> wrote:
> >>>>>>
> >>>>>>> Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851 
> >>>>>>> <https://gerrit.ovirt.org/#/c/89851>.
> >>>>>>> Is it still failing?
> >>>>>>>
> >>>>>>> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <[email protected] 
> >>>>>>> <mailto:[email protected]>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> On 7 April 2018 at 00:30, Dan Kenigsberg <[email protected] 
> >>>>>>>> <mailto:[email protected]>> wrote:
> >>>>>>>> > No, I am afraid that we have not managed to understand why setting
> >>>>>>>> and
> >>>>>>>> > ipv6 address too the host off the grid. We shall continue
> >>>>>>>> researching
> >>>>>>>> > this next week.
> >>>>>>>> >
> >>>>>>>> > Edy, https://gerrit.ovirt.org/#/c/88637/ 
> >>>>>>>> > <https://gerrit.ovirt.org/#/c/88637/> is already 4 weeks old,
> >>>>>>>> but
> >>>>>>>> > could it possibly be related (I really doubt that)?
> >>>>>>>> >
> >>>>>>>>
> >>>>>>>
> >>>>>> Sorry, but I do not see how this problem is related to VDSM.
> >>>>>> There is nothing that indicates that there is a VDSM problem.
> >>>>>>
> >>>>>> Has the RPC connection between Engine and VDSM failed?
> >>>>>>
> >>>>>>
> >>>>> Further up the thread, Piotr noticed that (at least on one failure of
> >>>>> this test) that the Vdsm host lost connectivity to its storage, and Vdsm
> >>>>> process was restarted. However, this does not seems to happen in all 
> >>>>> cases
> >>>>> where this test fails.
> >>>>>
> >>>>> _______________________________________________
> >>>>> Devel mailing list
> >>>>> [email protected] <mailto:[email protected]>
> >>>>> http://lists.ovirt.org/mailman/listinfo/devel 
> >>>>> <http://lists.ovirt.org/mailman/listinfo/devel>
> >>>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Devel mailing list
> >>>> [email protected] <mailto:[email protected]>
> >>>> http://lists.ovirt.org/mailman/listinfo/devel 
> >>>> <http://lists.ovirt.org/mailman/listinfo/devel>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> *GAL bEN HAIM*
> >>> RHV DEVOPS
> >>>
> >>
> >>
> >> _______________________________________________
> >> Devel mailing list
> >> [email protected] <mailto:[email protected]>
> >> http://lists.ovirt.org/mailman/listinfo/devel 
> >> <http://lists.ovirt.org/mailman/listinfo/devel>
> >>
> _______________________________________________
> Devel mailing list
> [email protected] <mailto:[email protected]>
> http://lists.ovirt.org/mailman/listinfo/devel 
> <http://lists.ovirt.org/mailman/listinfo/devel>
> _______________________________________________
> Devel mailing list
> [email protected]
> http://lists.ovirt.org/mailman/listinfo/devel

_______________________________________________
Devel mailing list
[email protected]
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

Reply via email to