> On 16 Sep 2019, at 10:30, Milan Zamazal <[email protected]> wrote: > > Dusan Fodor <[email protected]> writes: > >> After even more investigation, root of issue seems to lie in vdsm receiving >> SIGTERM in the only host that is in state up [1]: >> *[vds] Received signal 15, shutting down (vdsmd:70)* > > I see, thank you for looking into it and finding the signal. Can you > see in the logs what could cause this? Are Engine fencing attempts > issued before or after this signal? If it is not caused by Engine > fencing, is there anything in the system logs explaining that SIGTERM?
unrelated > > Let's take the upcoming OST gating as an opportunity to fix that host > status flipping problem. It must be fixed before OST gating is enabled. it seems rather infra-related to the initOnVdsUp() processing. Best for now would be to wait a little and try again to check the Host status once it’s Up for the first time. Thanks, michal > >> while the other host is still in status Installing (so it cannot be used >> for fencing- hence the fence action failure). >> The vdsm then goes back up in few moments, but engine, expecting the host >> is up all the time, meanwhile fails doing an operation that requires host >> to be up. >> >> [1] >> https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15829/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-host-0/_var_log/vdsm/vdsm.log >> >> On Fri, Sep 13, 2019 at 5:18 PM Dusan Fodor <[email protected]> wrote: >> >>> For brave investigators, similar issue in later stage of the same test can >>> be found here [1]. Same symptom of fence action fail, but this time it >>> causes failure for adding storage itself: >>> *2019-09-12 09:53:32,571-04 ERROR >>> [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default >>> task-1) [] Operation Failed: [Cannot attach Storage. There is no active >>> Host in the Data Center.]* >>> >>> [1] https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15821 >>> >>> On Fri, Sep 13, 2019 at 5:09 PM Dusan Fodor <[email protected]> wrote: >>> >>>> Hello all, >>>> lately i witnessed multiple failures for add_master_storage_domain test, >>>> which were not related to changes themselves, nor any infra issue. One >>>> example can be found here [1]. >>>> After investigation with huge help of Milan, issue is that Host falls >>>> from up state to whatever-but-not-up suddenly. >>>> >>>> >>>> 1. add_storage_domain picks a random host that is in up state >>>> 2. meantime engine starts fence action for it, so probably something >>>> gone bad with the host; the fence action fails with: >>>> *[org.ovirt.engine.core.bll.pm.FenceProxyLocator] >>>> (EE-ManagedThreadFactory-engineScheduled-Thread-38) [6692895f] Can not >>>> run >>>> fence action on host 'lago-basic-suite-master-host-0', no suitable proxy >>>> host was found.* >>>> 3. test fails on not being able to attach the domain to non-up >>>> host: >>>> *[org.ovirt.engine.api.restapi.resource.AbstractBackendResource] >>>> (default task-1) [] Operation Failed: [Cannot add storage server >>>> connection >>>> when Host status is not up]* >>>> >>>> For better orientation in failed job's engine log [1], fence action for >>>> host fails at >>>> :46:12,842-04 >>>> engine learns it cannot connect storage to host at >>>> :46:16,105-04 >>>> >>>> The test itself add_master_storage_domain starts at ~ :46:13,753 >>>> (according to lago log). >>>> >>>> Could you please check this? >>>> Thanks >>>> >>>> [1] https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15829 >>>> [2] >>>> https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15829/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-engine/_var_log/ovirt-engine/engine.log >>>> >>>> >> _______________________________________________ >> Devel mailing list -- [email protected] >> To unsubscribe send an email to [email protected] >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >> oVirt Code of Conduct: >> https://www.ovirt.org/community/about/community-guidelines/ >> List Archives: >> https://lists.ovirt.org/archives/list/[email protected]/message/MMH7DGCH24G7VVBGHXEFT3AKKJP726PL/ > _______________________________________________ > Devel mailing list -- [email protected] > To unsubscribe send an email to [email protected] > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > https://lists.ovirt.org/archives/list/[email protected]/message/KQY5JULWUDTJPQNQ4L6UDR4JSDIZS6IO/ _______________________________________________ Devel mailing list -- [email protected] To unsubscribe send an email to [email protected] Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/[email protected]/message/WMOMWQECMKITJSSXQAYIRGXHRFGS4BMJ/
