Michal Skrivanek <[email protected]> writes:

>> On 16 Sep 2019, at 10:30, Milan Zamazal <[email protected]> wrote:
>> 
>> Dusan Fodor <[email protected]> writes:
>> 
>>> After even more investigation, root of issue seems to lie in vdsm receiving
>>> SIGTERM in the only host that is in state up [1]:
>>> *[vds] Received signal 15, shutting down (vdsmd:70)*
>> 
>> I see, thank you for looking into it and finding the signal.  Can you
>> see in the logs what could cause this?  Are Engine fencing attempts
>> issued before or after this signal?  If it is not caused by Engine
>> fencing, is there anything in the system logs explaining that SIGTERM?
>
> unrelated
>
>> 
>> Let's take the upcoming OST gating as an opportunity to fix that host
>> status flipping problem.  It must be fixed before OST gating is enabled.
>
> it seems rather infra-related to the initOnVdsUp() processing. Best
> for now would be to wait a little and try again to check the Host
> status once it’s Up for the first time.

Is there any alternative to waiting?  Such as checking that some VDS Up
event or so appeared twice?

> Thanks,
> michal
>
>> 
>>> while the other host is still in status Installing (so it cannot be used
>>> for fencing- hence the fence action failure).
>>> The vdsm then goes back up in few moments, but engine, expecting the host
>>> is up all the time, meanwhile fails doing an operation that requires host
>>> to be up.
>>> 
>>> [1]
>>> https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15829/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-host-0/_var_log/vdsm/vdsm.log
>>> 
>>> On Fri, Sep 13, 2019 at 5:18 PM Dusan Fodor <[email protected]> wrote:
>>> 
>>>> For brave investigators, similar issue in later stage of the same test can
>>>> be found here [1]. Same symptom of fence action fail, but this time it
>>>> causes failure for adding storage itself:
>>>> *2019-09-12 09:53:32,571-04 ERROR
>>>> [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default
>>>> task-1) [] Operation Failed: [Cannot attach Storage. There is no active
>>>> Host in the Data Center.]*
>>>> 
>>>> [1] https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15821
>>>> 
>>>> On Fri, Sep 13, 2019 at 5:09 PM Dusan Fodor <[email protected]> wrote:
>>>> 
>>>>> Hello all,
>>>>> lately i witnessed multiple failures for add_master_storage_domain test,
>>>>> which were not related to changes themselves, nor any infra issue. One
>>>>> example can be found here [1].
>>>>> After investigation with huge help of Milan, issue is that Host falls
>>>>> from up state to whatever-but-not-up suddenly.
>>>>> 
>>>>> 
>>>>>   1. add_storage_domain picks a random host that is in up state
>>>>>   2. meantime engine starts fence action for it, so probably something
>>>>>   gone bad with the host; the fence action fails with:
>>>>> *[org.ovirt.engine.core.bll.pm.FenceProxyLocator]
>>>>>   (EE-ManagedThreadFactory-engineScheduled-Thread-38) [6692895f] Can not 
>>>>> run
>>>>>   fence action on host 'lago-basic-suite-master-host-0', no suitable proxy
>>>>>   host was found.*
>>>>>   3. test fails on not being able to attach the domain to non-up
>>>>> host:
>>>>> *[org.ovirt.engine.api.restapi.resource.AbstractBackendResource]
>>>>>   (default task-1) [] Operation Failed: [Cannot add storage server 
>>>>> connection
>>>>>   when Host status is not up]*
>>>>> 
>>>>> For better orientation in failed job's engine log [1], fence action for
>>>>> host fails at
>>>>> :46:12,842-04
>>>>> engine learns it cannot connect storage to host at
>>>>> :46:16,105-04
>>>>> 
>>>>> The test itself add_master_storage_domain starts at ~ :46:13,753
>>>>> (according to lago log).
>>>>> 
>>>>> Could you please check this?
>>>>> Thanks
>>>>> 
>>>>> [1] https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15829
>>>>> [2]
>>>>> https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15829/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-engine/_var_log/ovirt-engine/engine.log
>>>>> 
>>>>> 
>>> _______________________________________________
>>> Devel mailing list -- [email protected]
>>> To unsubscribe send an email to [email protected]
>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>>> oVirt Code of Conduct: 
>>> https://www.ovirt.org/community/about/community-guidelines/
>>> List Archives:
>>> https://lists.ovirt.org/archives/list/[email protected]/message/MMH7DGCH24G7VVBGHXEFT3AKKJP726PL/
>> _______________________________________________
>> Devel mailing list -- [email protected]
>> To unsubscribe send an email to [email protected]
>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>> oVirt Code of Conduct: 
>> https://www.ovirt.org/community/about/community-guidelines/
>> List Archives:
>> https://lists.ovirt.org/archives/list/[email protected]/message/KQY5JULWUDTJPQNQ4L6UDR4JSDIZS6IO/
_______________________________________________
Devel mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/RPL5FZRWHOXV7ZPRUVR2662A4AAFD3D3/

Reply via email to