Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

Alona Kaplan Wed, 11 Apr 2018 02:46:16 -0700

On Tue, Apr 10, 2018 at 6:52 PM, Gal Ben Haim <[email protected]> wrote:


> I'm seeing the same error in [1], during 006_migrations.migrate_vm.
>
> [1] http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1650/
>

Seems like another bug. The migration failed since for some reason the vm
is already defined on the destination host.

2018-04-10 11:08:08,685-0400 ERROR (jsonrpc/0) [api] FINISH create
error=Virtual machine already exists (api:129)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in
method
ret = func(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 191, in create
raise exception.VMExists()
VMExists: Virtual machine already exists




>
>
> On Tue, Apr 10, 2018 at 4:14 PM, Alona Kaplan <[email protected]> wrote:
>
>> Hi all,
>>
>> Looking at the log it seems that the new GetCapabilitiesAsync is
>> responsible for the mess.
>>
>> -
>> * 08:29:47 - engine loses connectivity to host 
>> 'lago-basic-suite-4-2-host-0'.*
>>
>>
>>
>> *- Every 3 seconds a getCapabalititiesAsync request is sent to the host 
>> (unsuccessfully).*
>>
>>      * before each "getCapabilitiesAsync" the monitoring lock is taken 
>> (VdsManager,refreshImpl)
>>
>>      * "getCapabilitiesAsync" immediately fails and throws 
>> 'VDSNetworkException: java.net.ConnectException: Connection refused'. The 
>> exception is caught by 
>> 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls 
>> 'onFailure' of the callback and re-throws the exception.
>>
>>          catch (Throwable t) {
>>             getParameters().getCallback().onFailure(t);
>>             throw t;
>>          }
>>
>>     * The 'onFailure' of the callback releases the "monitoringLock" 
>> ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) 
>> lockManager.releaseLock(monitoringLock);')
>>
>>     * 'VdsManager,refreshImpl' catches the network exception, marks 
>> 'releaseLock = true' and *tries to release the already released lock*.
>>
>>       The following warning is printed to the log -
>>
>>       WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] 
>> (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release 
>> exclusive lock which does not exist, lock key: 
>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>>
>>
>>
>>
>> *- 08:30:51 a successful getCapabilitiesAsync is sent.*
>>
>>
>> *- 08:32:55 - The failing test starts (Setup Networks for setting ipv6).    *
>>
>>     * SetupNetworks takes the monitoring lock.
>>
>> *- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests from 
>> 4 minutes ago from its queue and prints a VDSNetworkException: Vds timeout 
>> occured.*
>>
>>       * When the first request is removed from the queue 
>> ('ResponseTracker.remove()'), the
>> *'Callback.onFailure' is invoked (for the second time) -> monitoring lock is 
>> released (the lock taken by the SetupNetworks!).*
>>
>>       * *The other requests removed from the queue also try to release the 
>> monitoring lock*, but there is nothing to release.
>>
>>       * The following warning log is printed -
>>         WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] 
>> (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release 
>> exclusive lock which does not exist, lock key: 
>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>>
>> - *08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started*. 
>> Why? I'm not 100% sure but I guess the late processing of the 
>> 'getCapabilitiesAsync' that causes losing of the monitoring lock and the 
>> late + mupltiple processing of failure is root cause.
>>
>>
>> Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is trying 
>> to be released three times. Please share your opinion regarding how it 
>> should be fixed.
>>
>>
>> Thanks,
>>
>> Alona.
>>
>>
>>
>>
>>
>>
>> On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <[email protected]> wrote:
>>
>>> On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <[email protected]> wrote:
>>>
>>>>
>>>>
>>>> On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <[email protected]> wrote:
>>>>
>>>>> Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851.
>>>>> Is it still failing?
>>>>>
>>>>> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> On 7 April 2018 at 00:30, Dan Kenigsberg <[email protected]> wrote:
>>>>>> > No, I am afraid that we have not managed to understand why setting
>>>>>> and
>>>>>> > ipv6 address too the host off the grid. We shall continue
>>>>>> researching
>>>>>> > this next week.
>>>>>> >
>>>>>> > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old,
>>>>>> but
>>>>>> > could it possibly be related (I really doubt that)?
>>>>>> >
>>>>>>
>>>>>
>>>> Sorry, but I do not see how this problem is related to VDSM.
>>>> There is nothing that indicates that there is a VDSM problem.
>>>>
>>>> Has the RPC connection between Engine and VDSM failed?
>>>>
>>>>
>>> Further up the thread, Piotr noticed that (at least on one failure of
>>> this test) that the Vdsm host lost connectivity to its storage, and Vdsm
>>> process was restarted. However, this does not seems to happen in all cases
>>> where this test fails.
>>>
>>> _______________________________________________
>>> Devel mailing list
>>> [email protected]
>>> http://lists.ovirt.org/mailman/listinfo/devel
>>>
>>
>>
>> _______________________________________________
>> Devel mailing list
>> [email protected]
>> http://lists.ovirt.org/mailman/listinfo/devel
>>
>
>
>
> --
> *GAL bEN HAIM*
> RHV DEVOPS
>

_______________________________________________
Devel mailing list
[email protected]
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

Reply via email to