> On 16 Sep 2019, at 10:30, Milan Zamazal <[email protected]> wrote:
> 
> Dusan Fodor <[email protected]> writes:
> 
>> After even more investigation, root of issue seems to lie in vdsm receiving
>> SIGTERM in the only host that is in state up [1]:
>> *[vds] Received signal 15, shutting down (vdsmd:70)*
> 
> I see, thank you for looking into it and finding the signal.  Can you
> see in the logs what could cause this?  Are Engine fencing attempts
> issued before or after this signal?  If it is not caused by Engine
> fencing, is there anything in the system logs explaining that SIGTERM?

unrelated

> 
> Let's take the upcoming OST gating as an opportunity to fix that host
> status flipping problem.  It must be fixed before OST gating is enabled.

it seems rather infra-related to the initOnVdsUp() processing. Best for now 
would be to wait a little and try again to check the Host status once it’s Up 
for the first time.

Thanks,
michal

> 
>> while the other host is still in status Installing (so it cannot be used
>> for fencing- hence the fence action failure).
>> The vdsm then goes back up in few moments, but engine, expecting the host
>> is up all the time, meanwhile fails doing an operation that requires host
>> to be up.
>> 
>> [1]
>> https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15829/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-host-0/_var_log/vdsm/vdsm.log
>> 
>> On Fri, Sep 13, 2019 at 5:18 PM Dusan Fodor <[email protected]> wrote:
>> 
>>> For brave investigators, similar issue in later stage of the same test can
>>> be found here [1]. Same symptom of fence action fail, but this time it
>>> causes failure for adding storage itself:
>>> *2019-09-12 09:53:32,571-04 ERROR
>>> [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default
>>> task-1) [] Operation Failed: [Cannot attach Storage. There is no active
>>> Host in the Data Center.]*
>>> 
>>> [1] https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15821
>>> 
>>> On Fri, Sep 13, 2019 at 5:09 PM Dusan Fodor <[email protected]> wrote:
>>> 
>>>> Hello all,
>>>> lately i witnessed multiple failures for add_master_storage_domain test,
>>>> which were not related to changes themselves, nor any infra issue. One
>>>> example can be found here [1].
>>>> After investigation with huge help of Milan, issue is that Host falls
>>>> from up state to whatever-but-not-up suddenly.
>>>> 
>>>> 
>>>>   1. add_storage_domain picks a random host that is in up state
>>>>   2. meantime engine starts fence action for it, so probably something
>>>>   gone bad with the host; the fence action fails with:
>>>> *[org.ovirt.engine.core.bll.pm.FenceProxyLocator]
>>>>   (EE-ManagedThreadFactory-engineScheduled-Thread-38) [6692895f] Can not 
>>>> run
>>>>   fence action on host 'lago-basic-suite-master-host-0', no suitable proxy
>>>>   host was found.*
>>>>   3. test fails on not being able to attach the domain to non-up
>>>> host:
>>>> *[org.ovirt.engine.api.restapi.resource.AbstractBackendResource]
>>>>   (default task-1) [] Operation Failed: [Cannot add storage server 
>>>> connection
>>>>   when Host status is not up]*
>>>> 
>>>> For better orientation in failed job's engine log [1], fence action for
>>>> host fails at
>>>> :46:12,842-04
>>>> engine learns it cannot connect storage to host at
>>>> :46:16,105-04
>>>> 
>>>> The test itself add_master_storage_domain starts at ~ :46:13,753
>>>> (according to lago log).
>>>> 
>>>> Could you please check this?
>>>> Thanks
>>>> 
>>>> [1] https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15829
>>>> [2]
>>>> https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15829/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-engine/_var_log/ovirt-engine/engine.log
>>>> 
>>>> 
>> _______________________________________________
>> Devel mailing list -- [email protected]
>> To unsubscribe send an email to [email protected]
>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>> oVirt Code of Conduct: 
>> https://www.ovirt.org/community/about/community-guidelines/
>> List Archives:
>> https://lists.ovirt.org/archives/list/[email protected]/message/MMH7DGCH24G7VVBGHXEFT3AKKJP726PL/
> _______________________________________________
> Devel mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct: 
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives: 
> https://lists.ovirt.org/archives/list/[email protected]/message/KQY5JULWUDTJPQNQ4L6UDR4JSDIZS6IO/
_______________________________________________
Devel mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/WMOMWQECMKITJSSXQAYIRGXHRFGS4BMJ/

Reply via email to