> On 12 Apr 2018, at 17:15, Dafna Ron <[email protected]> wrote: > > hi, > > we are failing randomly on test 006_migrations.migrate_vm with what seems to > be the same issue. > > the vm seems to be migrated successfully but engine thinks that it failed and > re-calls migration getting a response of vm already exists. > > I don't think this is an issue with the test but rather a regression so I > opened a bug:
i do not think so, I’ve heard someone removed a test in between migrating A->B and B->A? If that’s the case that is the real issue. You can’t migrate back to A without waiting for A to be cleared out properly https://gerrit.ovirt.org/#/c/90166/ <https://gerrit.ovirt.org/#/c/90166/> should fix it > > https://bugzilla.redhat.com/show_bug.cgi?id=1566594 > <https://bugzilla.redhat.com/show_bug.cgi?id=1566594> > > Thanks, > Dafna > > > On Wed, Apr 11, 2018 at 1:52 PM, Milan Zamazal <[email protected] > <mailto:[email protected]>> wrote: > Arik Hadas <[email protected] <mailto:[email protected]>> writes: > > > On Wed, Apr 11, 2018 at 12:45 PM, Alona Kaplan <[email protected] > > <mailto:[email protected]>> wrote: > > > >> > >> > >> On Tue, Apr 10, 2018 at 6:52 PM, Gal Ben Haim <[email protected] > >> <mailto:[email protected]>> wrote: > >> > >>> I'm seeing the same error in [1], during 006_migrations.migrate_vm. > >>> > >>> [1] http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1650/ > >>> <http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1650/> > >>> > >> > >> Seems like another bug. The migration failed since for some reason the vm > >> is already defined on the destination host. > >> > >> 2018-04-10 11:08:08,685-0400 ERROR (jsonrpc/0) [api] FINISH create > >> error=Virtual machine already exists (api:129) > >> Traceback (most recent call last): > >> File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in > >> method > >> ret = func(*args, **kwargs) > >> File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 191, in create > >> raise exception.VMExists() > >> VMExists: Virtual machine already exists > >> > >> > > Milan, Francesco, could it be that because of [1] that appears on the > > destination host right after shutting down the VM, it remained defined on > > that host? > > I can't see any destroy call in the logs after the successful preceding > migration from the given host. That would explain “VMExists” error. > > > [1] 2018-04-10 11:01:40,005-0400 ERROR (libvirt/events) [vds] Error running > > VM callback (clientIF:683) > > > > Traceback (most recent call last): > > > > File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 646, in > > dispatchLibvirtEvents > > > > v.onLibvirtLifecycleEvent(event, detail, None) > > > > AttributeError: 'NoneType' object has no attribute 'onLibvirtLifecycleEvent' > > That means that a life cycle event on an unknown VM has arrived, in this > case apparently destroy event, following the destroy call after the > failed incoming migration. The reported AttributeError is a minor bug, > already fixed in master. So it's most likely unrelated to the discussed > problem. > > >>> On Tue, Apr 10, 2018 at 4:14 PM, Alona Kaplan <[email protected] > >>> <mailto:[email protected]>> > >>> wrote: > >>> > >>>> Hi all, > >>>> > >>>> Looking at the log it seems that the new GetCapabilitiesAsync is > >>>> responsible for the mess. > >>>> > >>>> - > >>>> * 08:29:47 - engine loses connectivity to host > >>>> 'lago-basic-suite-4-2-host-0'.* > >>>> > >>>> > >>>> > >>>> *- Every 3 seconds a getCapabalititiesAsync request is sent to the host > >>>> (unsuccessfully).* > >>>> > >>>> * before each "getCapabilitiesAsync" the monitoring lock is taken > >>>> (VdsManager,refreshImpl) > >>>> > >>>> * "getCapabilitiesAsync" immediately fails and throws > >>>> 'VDSNetworkException: java.net.ConnectException: Connection refused'. > >>>> The exception is caught by > >>>> 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls > >>>> 'onFailure' of the callback and re-throws the exception. > >>>> > >>>> catch (Throwable t) { > >>>> getParameters().getCallback().onFailure(t); > >>>> throw t; > >>>> } > >>>> > >>>> * The 'onFailure' of the callback releases the "monitoringLock" > >>>> ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) > >>>> lockManager.releaseLock(monitoringLock);') > >>>> > >>>> * 'VdsManager,refreshImpl' catches the network exception, marks > >>>> 'releaseLock = true' and *tries to release the already released lock*. > >>>> > >>>> The following warning is printed to the log - > >>>> > >>>> WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] > >>>> (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release > >>>> exclusive lock which does not exist, lock key: > >>>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT' > >>>> > >>>> > >>>> > >>>> > >>>> *- 08:30:51 a successful getCapabilitiesAsync is sent.* > >>>> > >>>> > >>>> *- 08:32:55 - The failing test starts (Setup Networks for setting ipv6). > >>>> * > >>>> > >>>> * SetupNetworks takes the monitoring lock. > >>>> > >>>> *- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests > >>>> from 4 minutes ago from its queue and prints a VDSNetworkException: Vds > >>>> timeout occured.* > >>>> > >>>> * When the first request is removed from the queue > >>>> ('ResponseTracker.remove()'), the > >>>> *'Callback.onFailure' is invoked (for the second time) -> monitoring > >>>> lock is released (the lock taken by the SetupNetworks!).* > >>>> > >>>> * *The other requests removed from the queue also try to release > >>>> the monitoring lock*, but there is nothing to release. > >>>> > >>>> * The following warning log is printed - > >>>> WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] > >>>> (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release > >>>> exclusive lock which does not exist, lock key: > >>>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT' > >>>> > >>>> - *08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is > >>>> started*. Why? I'm not 100% sure but I guess the late processing of the > >>>> 'getCapabilitiesAsync' that causes losing of the monitoring lock and the > >>>> late + mupltiple processing of failure is root cause. > >>>> > >>>> > >>>> Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is > >>>> trying to be released three times. Please share your opinion regarding > >>>> how it should be fixed. > >>>> > >>>> > >>>> Thanks, > >>>> > >>>> Alona. > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <[email protected] > >>>> <mailto:[email protected]>> > >>>> wrote: > >>>> > >>>>> On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <[email protected] > >>>>> <mailto:[email protected]>> wrote: > >>>>> > >>>>>> > >>>>>> > >>>>>> On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <[email protected] > >>>>>> <mailto:[email protected]>> wrote: > >>>>>> > >>>>>>> Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851 > >>>>>>> <https://gerrit.ovirt.org/#/c/89851>. > >>>>>>> Is it still failing? > >>>>>>> > >>>>>>> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <[email protected] > >>>>>>> <mailto:[email protected]>> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> On 7 April 2018 at 00:30, Dan Kenigsberg <[email protected] > >>>>>>>> <mailto:[email protected]>> wrote: > >>>>>>>> > No, I am afraid that we have not managed to understand why setting > >>>>>>>> and > >>>>>>>> > ipv6 address too the host off the grid. We shall continue > >>>>>>>> researching > >>>>>>>> > this next week. > >>>>>>>> > > >>>>>>>> > Edy, https://gerrit.ovirt.org/#/c/88637/ > >>>>>>>> > <https://gerrit.ovirt.org/#/c/88637/> is already 4 weeks old, > >>>>>>>> but > >>>>>>>> > could it possibly be related (I really doubt that)? > >>>>>>>> > > >>>>>>>> > >>>>>>> > >>>>>> Sorry, but I do not see how this problem is related to VDSM. > >>>>>> There is nothing that indicates that there is a VDSM problem. > >>>>>> > >>>>>> Has the RPC connection between Engine and VDSM failed? > >>>>>> > >>>>>> > >>>>> Further up the thread, Piotr noticed that (at least on one failure of > >>>>> this test) that the Vdsm host lost connectivity to its storage, and Vdsm > >>>>> process was restarted. However, this does not seems to happen in all > >>>>> cases > >>>>> where this test fails. > >>>>> > >>>>> _______________________________________________ > >>>>> Devel mailing list > >>>>> [email protected] <mailto:[email protected]> > >>>>> http://lists.ovirt.org/mailman/listinfo/devel > >>>>> <http://lists.ovirt.org/mailman/listinfo/devel> > >>>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> Devel mailing list > >>>> [email protected] <mailto:[email protected]> > >>>> http://lists.ovirt.org/mailman/listinfo/devel > >>>> <http://lists.ovirt.org/mailman/listinfo/devel> > >>>> > >>> > >>> > >>> > >>> -- > >>> *GAL bEN HAIM* > >>> RHV DEVOPS > >>> > >> > >> > >> _______________________________________________ > >> Devel mailing list > >> [email protected] <mailto:[email protected]> > >> http://lists.ovirt.org/mailman/listinfo/devel > >> <http://lists.ovirt.org/mailman/listinfo/devel> > >> > _______________________________________________ > Devel mailing list > [email protected] <mailto:[email protected]> > http://lists.ovirt.org/mailman/listinfo/devel > <http://lists.ovirt.org/mailman/listinfo/devel> > _______________________________________________ > Devel mailing list > [email protected] > http://lists.ovirt.org/mailman/listinfo/devel
_______________________________________________ Devel mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/devel
