Milan Zamazal <[email protected]> writes: > Michal Skrivanek <[email protected]> writes: > >>> On 16 Sep 2019, at 10:30, Milan Zamazal <[email protected]> wrote: >>> >>> Dusan Fodor <[email protected]> writes: >>> >>>> After even more investigation, root of issue seems to lie in vdsm receiving >>>> SIGTERM in the only host that is in state up [1]: >>>> *[vds] Received signal 15, shutting down (vdsmd:70)* >>> >>> I see, thank you for looking into it and finding the signal. Can you >>> see in the logs what could cause this? Are Engine fencing attempts >>> issued before or after this signal? If it is not caused by Engine >>> fencing, is there anything in the system logs explaining that SIGTERM? >> >> unrelated >> >>> >>> Let's take the upcoming OST gating as an opportunity to fix that host >>> status flipping problem. It must be fixed before OST gating is enabled. >> >> it seems rather infra-related to the initOnVdsUp() processing. Best >> for now would be to wait a little and try again to check the Host >> status once it’s Up for the first time. > > Is there any alternative to waiting? Such as checking that some VDS Up > event or so appeared twice?
Is anybody working on any fix of the failure? >> Thanks, >> michal >> >>> >>>> while the other host is still in status Installing (so it cannot be used >>>> for fencing- hence the fence action failure). >>>> The vdsm then goes back up in few moments, but engine, expecting the host >>>> is up all the time, meanwhile fails doing an operation that requires host >>>> to be up. >>>> >>>> [1] >>>> https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15829/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-host-0/_var_log/vdsm/vdsm.log >>>> >>>> On Fri, Sep 13, 2019 at 5:18 PM Dusan Fodor <[email protected]> wrote: >>>> >>>>> For brave investigators, similar issue in later stage of the same test can >>>>> be found here [1]. Same symptom of fence action fail, but this time it >>>>> causes failure for adding storage itself: >>>>> *2019-09-12 09:53:32,571-04 ERROR >>>>> [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default >>>>> task-1) [] Operation Failed: [Cannot attach Storage. There is no active >>>>> Host in the Data Center.]* >>>>> >>>>> [1] https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15821 >>>>> >>>>> On Fri, Sep 13, 2019 at 5:09 PM Dusan Fodor <[email protected]> wrote: >>>>> >>>>>> Hello all, >>>>>> lately i witnessed multiple failures for add_master_storage_domain test, >>>>>> which were not related to changes themselves, nor any infra issue. One >>>>>> example can be found here [1]. >>>>>> After investigation with huge help of Milan, issue is that Host falls >>>>>> from up state to whatever-but-not-up suddenly. >>>>>> >>>>>> >>>>>> 1. add_storage_domain picks a random host that is in up state >>>>>> 2. meantime engine starts fence action for it, so probably something >>>>>> gone bad with the host; the fence action fails with: >>>>>> *[org.ovirt.engine.core.bll.pm.FenceProxyLocator] >>>>>> (EE-ManagedThreadFactory-engineScheduled-Thread-38) [6692895f] Can not >>>>>> run >>>>>> fence action on host 'lago-basic-suite-master-host-0', no suitable >>>>>> proxy >>>>>> host was found.* >>>>>> 3. test fails on not being able to attach the domain to non-up >>>>>> host: >>>>>> *[org.ovirt.engine.api.restapi.resource.AbstractBackendResource] >>>>>> (default task-1) [] Operation Failed: [Cannot add storage server >>>>>> connection >>>>>> when Host status is not up]* >>>>>> >>>>>> For better orientation in failed job's engine log [1], fence action for >>>>>> host fails at >>>>>> :46:12,842-04 >>>>>> engine learns it cannot connect storage to host at >>>>>> :46:16,105-04 >>>>>> >>>>>> The test itself add_master_storage_domain starts at ~ :46:13,753 >>>>>> (according to lago log). >>>>>> >>>>>> Could you please check this? >>>>>> Thanks >>>>>> >>>>>> [1] https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15829 >>>>>> [2] >>>>>> https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15829/artifact/basic-suite.el7.x86_64/test_logs/basic-suite-master/post-002_bootstrap.py/lago-basic-suite-master-engine/_var_log/ovirt-engine/engine.log >>>>>> >>>>>> >>>> _______________________________________________ >>>> Devel mailing list -- [email protected] >>>> To unsubscribe send an email to [email protected] >>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>> oVirt Code of Conduct: >>>> https://www.ovirt.org/community/about/community-guidelines/ >>>> List Archives: >>>> https://lists.ovirt.org/archives/list/[email protected]/message/MMH7DGCH24G7VVBGHXEFT3AKKJP726PL/ >>> _______________________________________________ >>> Devel mailing list -- [email protected] >>> To unsubscribe send an email to [email protected] >>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>> oVirt Code of Conduct: >>> https://www.ovirt.org/community/about/community-guidelines/ >>> List Archives: >>> https://lists.ovirt.org/archives/list/[email protected]/message/KQY5JULWUDTJPQNQ4L6UDR4JSDIZS6IO/ _______________________________________________ Devel mailing list -- [email protected] To unsubscribe send an email to [email protected] Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/[email protected]/message/O6TVKPXT2C3ASWNCDBP7LQEORJ5WIOUM/
