Michal Skrivanek <[email protected]> writes: >> On 16 Sep 2019, at 10:30, Milan Zamazal <[email protected]> wrote: >> >> Dusan Fodor <[email protected]> writes: >> >>> After even more investigation, root of issue seems to lie in vdsm receiving >>> SIGTERM in the only host that is in state up [1]: >>> *[vds] Received signal 15, shutting down (vdsmd:70)* >> >> I see, thank you for looking into it and finding the signal. Can you >> see in the logs what could cause this? Are Engine fencing attempts >> issued before or after this signal? If it is not caused by Engine >> fencing, is there anything in the system logs explaining that SIGTERM? > > unrelated > >> >> Let's take the upcoming OST gating as an opportunity to fix that host >> status flipping problem. It must be fixed before OST gating is enabled. > > it seems rather infra-related to the initOnVdsUp() processing. Best > for now would be to wait a little and try again to check the Host > status once it’s Up for the first time.
Is there any alternative to waiting? Such as checking that some VDS Up event or so appeared twice? > Thanks, > michal > >> >>> while the other host is still in status Installing (so it cannot be used >>> for fencing- hence the fence action failure). >>> The vdsm then goes back up in few moments, but engine, expecting the host >>> is up all the time, meanwhile fails doing an operation that requires host >>> to be up. >>> >>> [1] >>> https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15829/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-host-0/_var_log/vdsm/vdsm.log >>> >>> On Fri, Sep 13, 2019 at 5:18 PM Dusan Fodor <[email protected]> wrote: >>> >>>> For brave investigators, similar issue in later stage of the same test can >>>> be found here [1]. Same symptom of fence action fail, but this time it >>>> causes failure for adding storage itself: >>>> *2019-09-12 09:53:32,571-04 ERROR >>>> [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default >>>> task-1) [] Operation Failed: [Cannot attach Storage. There is no active >>>> Host in the Data Center.]* >>>> >>>> [1] https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15821 >>>> >>>> On Fri, Sep 13, 2019 at 5:09 PM Dusan Fodor <[email protected]> wrote: >>>> >>>>> Hello all, >>>>> lately i witnessed multiple failures for add_master_storage_domain test, >>>>> which were not related to changes themselves, nor any infra issue. One >>>>> example can be found here [1]. >>>>> After investigation with huge help of Milan, issue is that Host falls >>>>> from up state to whatever-but-not-up suddenly. >>>>> >>>>> >>>>> 1. add_storage_domain picks a random host that is in up state >>>>> 2. meantime engine starts fence action for it, so probably something >>>>> gone bad with the host; the fence action fails with: >>>>> *[org.ovirt.engine.core.bll.pm.FenceProxyLocator] >>>>> (EE-ManagedThreadFactory-engineScheduled-Thread-38) [6692895f] Can not >>>>> run >>>>> fence action on host 'lago-basic-suite-master-host-0', no suitable proxy >>>>> host was found.* >>>>> 3. test fails on not being able to attach the domain to non-up >>>>> host: >>>>> *[org.ovirt.engine.api.restapi.resource.AbstractBackendResource] >>>>> (default task-1) [] Operation Failed: [Cannot add storage server >>>>> connection >>>>> when Host status is not up]* >>>>> >>>>> For better orientation in failed job's engine log [1], fence action for >>>>> host fails at >>>>> :46:12,842-04 >>>>> engine learns it cannot connect storage to host at >>>>> :46:16,105-04 >>>>> >>>>> The test itself add_master_storage_domain starts at ~ :46:13,753 >>>>> (according to lago log). >>>>> >>>>> Could you please check this? >>>>> Thanks >>>>> >>>>> [1] https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15829 >>>>> [2] >>>>> https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15829/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-engine/_var_log/ovirt-engine/engine.log >>>>> >>>>> >>> _______________________________________________ >>> Devel mailing list -- [email protected] >>> To unsubscribe send an email to [email protected] >>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>> oVirt Code of Conduct: >>> https://www.ovirt.org/community/about/community-guidelines/ >>> List Archives: >>> https://lists.ovirt.org/archives/list/[email protected]/message/MMH7DGCH24G7VVBGHXEFT3AKKJP726PL/ >> _______________________________________________ >> Devel mailing list -- [email protected] >> To unsubscribe send an email to [email protected] >> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >> oVirt Code of Conduct: >> https://www.ovirt.org/community/about/community-guidelines/ >> List Archives: >> https://lists.ovirt.org/archives/list/[email protected]/message/KQY5JULWUDTJPQNQ4L6UDR4JSDIZS6IO/ _______________________________________________ Devel mailing list -- [email protected] To unsubscribe send an email to [email protected] Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/[email protected]/message/RPL5FZRWHOXV7ZPRUVR2662A4AAFD3D3/
