Milan Zamazal <[email protected]> writes:

> Michal Skrivanek <[email protected]> writes:
>
>>> On 16 Sep 2019, at 10:30, Milan Zamazal <[email protected]> wrote:
>>> 
>>> Dusan Fodor <[email protected]> writes:
>>> 
>>>> After even more investigation, root of issue seems to lie in vdsm receiving
>>>> SIGTERM in the only host that is in state up [1]:
>>>> *[vds] Received signal 15, shutting down (vdsmd:70)*
>>> 
>>> I see, thank you for looking into it and finding the signal.  Can you
>>> see in the logs what could cause this?  Are Engine fencing attempts
>>> issued before or after this signal?  If it is not caused by Engine
>>> fencing, is there anything in the system logs explaining that SIGTERM?
>>
>> unrelated
>>
>>> 
>>> Let's take the upcoming OST gating as an opportunity to fix that host
>>> status flipping problem.  It must be fixed before OST gating is enabled.
>>
>> it seems rather infra-related to the initOnVdsUp() processing. Best
>> for now would be to wait a little and try again to check the Host
>> status once it’s Up for the first time.
>
> Is there any alternative to waiting?  Such as checking that some VDS Up
> event or so appeared twice?

Is anybody working on any fix of the failure?

>> Thanks,
>> michal
>>
>>> 
>>>> while the other host is still in status Installing (so it cannot be used
>>>> for fencing- hence the fence action failure).
>>>> The vdsm then goes back up in few moments, but engine, expecting the host
>>>> is up all the time, meanwhile fails doing an operation that requires host
>>>> to be up.
>>>> 
>>>> [1]
>>>> https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15829/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-host-0/_var_log/vdsm/vdsm.log
>>>> 
>>>> On Fri, Sep 13, 2019 at 5:18 PM Dusan Fodor <[email protected]> wrote:
>>>> 
>>>>> For brave investigators, similar issue in later stage of the same test can
>>>>> be found here [1]. Same symptom of fence action fail, but this time it
>>>>> causes failure for adding storage itself:
>>>>> *2019-09-12 09:53:32,571-04 ERROR
>>>>> [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default
>>>>> task-1) [] Operation Failed: [Cannot attach Storage. There is no active
>>>>> Host in the Data Center.]*
>>>>> 
>>>>> [1] https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15821
>>>>> 
>>>>> On Fri, Sep 13, 2019 at 5:09 PM Dusan Fodor <[email protected]> wrote:
>>>>> 
>>>>>> Hello all,
>>>>>> lately i witnessed multiple failures for add_master_storage_domain test,
>>>>>> which were not related to changes themselves, nor any infra issue. One
>>>>>> example can be found here [1].
>>>>>> After investigation with huge help of Milan, issue is that Host falls
>>>>>> from up state to whatever-but-not-up suddenly.
>>>>>> 
>>>>>> 
>>>>>>   1. add_storage_domain picks a random host that is in up state
>>>>>>   2. meantime engine starts fence action for it, so probably something
>>>>>>   gone bad with the host; the fence action fails with:
>>>>>> *[org.ovirt.engine.core.bll.pm.FenceProxyLocator]
>>>>>>   (EE-ManagedThreadFactory-engineScheduled-Thread-38) [6692895f] Can not 
>>>>>> run
>>>>>>   fence action on host 'lago-basic-suite-master-host-0', no suitable 
>>>>>> proxy
>>>>>>   host was found.*
>>>>>>   3. test fails on not being able to attach the domain to non-up
>>>>>> host:
>>>>>> *[org.ovirt.engine.api.restapi.resource.AbstractBackendResource]
>>>>>>   (default task-1) [] Operation Failed: [Cannot add storage server 
>>>>>> connection
>>>>>>   when Host status is not up]*
>>>>>> 
>>>>>> For better orientation in failed job's engine log [1], fence action for
>>>>>> host fails at
>>>>>> :46:12,842-04
>>>>>> engine learns it cannot connect storage to host at
>>>>>> :46:16,105-04
>>>>>> 
>>>>>> The test itself add_master_storage_domain starts at ~ :46:13,753
>>>>>> (according to lago log).
>>>>>> 
>>>>>> Could you please check this?
>>>>>> Thanks
>>>>>> 
>>>>>> [1] https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15829
>>>>>> [2]
>>>>>> https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15829/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-engine/_var_log/ovirt-engine/engine.log
>>>>>> 
>>>>>> 
>>>> _______________________________________________
>>>> Devel mailing list -- [email protected]
>>>> To unsubscribe send an email to [email protected]
>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>>>> oVirt Code of Conduct: 
>>>> https://www.ovirt.org/community/about/community-guidelines/
>>>> List Archives:
>>>> https://lists.ovirt.org/archives/list/[email protected]/message/MMH7DGCH24G7VVBGHXEFT3AKKJP726PL/
>>> _______________________________________________
>>> Devel mailing list -- [email protected]
>>> To unsubscribe send an email to [email protected]
>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>>> oVirt Code of Conduct: 
>>> https://www.ovirt.org/community/about/community-guidelines/
>>> List Archives:
>>> https://lists.ovirt.org/archives/list/[email protected]/message/KQY5JULWUDTJPQNQ4L6UDR4JSDIZS6IO/
_______________________________________________
Devel mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/O6TVKPXT2C3ASWNCDBP7LQEORJ5WIOUM/

Reply via email to