On Wed, Jul 7, 2021 at 4:42 PM Eitan Raviv <[email protected]> wrote: > > Adding Ales as well. > AFAIK vdsm does not actively poll engine for liveness, nor does any retries. > But retries might be at a deeper infra level where Marcin is the person to > ask IIUC.
Right, no retries in vdsm, we send replies or events, and we don't have any way to tell if engine got the message. > On Wed, Jul 7, 2021 at 4:40 PM Yedidyah Bar David <[email protected]> wrote: >> >> On Wed, Jun 23, 2021 at 12:30 PM Yedidyah Bar David <[email protected]> wrote: >> > >> > On Wed, Jun 23, 2021 at 12:02 PM Sandro Bonazzola <[email protected]> >> > wrote: >> >> >> >> >> >> >> >> Il giorno mer 23 giu 2021 alle ore 07:48 Yedidyah Bar David >> >> <[email protected]> ha scritto: >> >>> >> >>> On Wed, Jun 9, 2021 at 12:13 PM Yedidyah Bar David <[email protected]> >> >>> wrote: >> >>> > >> >>> > On Tue, Jun 8, 2021 at 9:01 AM Yedidyah Bar David <[email protected]> >> >>> > wrote: >> >>> > > >> >>> > > On Tue, Jun 8, 2021 at 6:08 AM <[email protected]> wrote: >> >>> > > > >> >>> > > > Project: >> >>> > > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/ >> >>> > > > Build: >> >>> > > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/ >> >>> > > > Build Number: 2046 >> >>> > > > Build Status: Failure >> >>> > > > Triggered By: Started by timer >> >>> > > > >> >>> > > > ------------------------------------- >> >>> > > > Changes Since Last Success: >> >>> > > > ------------------------------------- >> >>> > > > Changes for Build #2046 >> >>> > > > [Eitan Raviv] network: force select spm - wait for dc status >> >>> > > > >> >>> > > > >> >>> > > > >> >>> > > > >> >>> > > > ----------------- >> >>> > > > Failed Tests: >> >>> > > > ----------------- >> >>> > > > 1 tests failed. >> >>> > > > FAILED: >> >>> > > > he-basic-suite-master.test-scenarios.test_004_basic_sanity.test_template_export >> >>> > > > >> >>> > > > Error Message: >> >>> > > > ovirtsdk4.Error: Failed to read response: [(<pycurl.Curl object at >> >>> > > > 0x5624fe64d108>, 7, 'Failed to connect to 192.168.200.99 port 443: >> >>> > > > Connection refused')] >> >>> > > >> >>> > > - The engine VM went down: >> >>> > > >> >>> > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/artifact/exported-artifacts/test_logs/he-basic-suite-master/lago-he-basic-suite-master-host-0/_var_log/ovirt-hosted-engine-ha/agent.log >> >>> > > >> >>> > > MainThread::INFO::2021-06-08 >> >>> > > 05:07:34,414::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) >> >>> > > Current state EngineUp (score: 3400) >> >>> > > MainThread::INFO::2021-06-08 >> >>> > > 05:07:44,575::states::135::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) >> >>> > > Penalizing score by 960 due to network status >> >>> > > >> >>> > > - Because HA monitoring failed to get a reply from the dns server: >> >>> > > >> >>> > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/artifact/exported-artifacts/test_logs/he-basic-suite-master/lago-he-basic-suite-master-host-0/_var_log/ovirt-hosted-engine-ha/broker.log >> >>> > > >> >>> > > Thread-1::WARNING::2021-06-08 >> >>> > > 05:07:25,486::network::120::network.Network::(_dns) DNS query failed: >> >>> > > ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 >> >>> > > ;; global options: +cmd >> >>> > > ;; connection timed out; no servers could be reached >> >>> > > >> >>> > > Thread-3::INFO::2021-06-08 >> >>> > > 05:07:28,543::mem_free::51::mem_free.MemFree::(action) memFree: 1801 >> >>> > > Thread-5::INFO::2021-06-08 >> >>> > > 05:07:28,972::engine_health::246::engine_health.EngineHealth::(_result_from_stats) >> >>> > > VM is up on this host with healthy engine >> >>> > > Thread-2::INFO::2021-06-08 >> >>> > > 05:07:31,532::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found >> >>> > > bridge ovirtmgmt in up state >> >>> > > Thread-1::WARNING::2021-06-08 >> >>> > > 05:07:33,011::network::120::network.Network::(_dns) DNS query failed: >> >>> > > ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 >> >>> > > ;; global options: +cmd >> >>> > > ;; connection timed out; no servers could be reached >> >>> > > >> >>> > > Thread-4::INFO::2021-06-08 >> >>> > > 05:07:37,433::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) >> >>> > > System load total=0.3196, engine=0.1724, non-engine=0.1472 >> >>> > > Thread-3::INFO::2021-06-08 >> >>> > > 05:07:37,839::mem_free::51::mem_free.MemFree::(action) memFree: 1735 >> >>> > > Thread-5::INFO::2021-06-08 >> >>> > > 05:07:39,146::engine_health::246::engine_health.EngineHealth::(_result_from_stats) >> >>> > > VM is up on this host with healthy engine >> >>> > > Thread-1::WARNING::2021-06-08 >> >>> > > 05:07:40,535::network::120::network.Network::(_dns) DNS query failed: >> >>> > > ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 >> >>> > > ;; global options: +cmd >> >>> > > ;; connection timed out; no servers could be reached >> >>> > > >> >>> > > Thread-1::WARNING::2021-06-08 >> >>> > > 05:07:40,535::network::92::network.Network::(action) Failed to verify >> >>> > > network status, (2 out of 5) >> >>> > > >> >>> > > - Not sure why. DNS servers: >> >>> > > >> >>> > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/artifact/exported-artifacts/test_logs/he-basic-suite-master/lago-he-basic-suite-master-host-0/_etc_resolv.conf >> >>> > > >> >>> > > # Generated by NetworkManager >> >>> > > search lago.local >> >>> > > nameserver 192.168.200.1 >> >>> > > nameserver fe80::5054:ff:fe0c:9ad0%ovirtmgmt >> >>> > > nameserver fd8f:1391:3a82:200::1 >> >>> >> >>> Now happened again: >> >>> >> >>> https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17434/ >> >>> >> >>> https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17434/artifact/check-patch.he-basic_suite_master.el8.x86_64/test_logs/lago-he-basic-suite-master-host-0/var/log/ovirt-hosted-engine-ha/broker.log >> >>> >> >>> Thread-1::INFO::2021-06-22 >> >>> 18:57:29,134::network::88::network.Network::(action) Successfully >> >>> verified network status >> >>> ... >> >>> >> >>> Thread-1::WARNING::2021-06-22 >> >>> 18:58:13,390::network::92::network.Network::(action) Failed to verify >> >>> network status, (0 out of 5) >> >>> Thread-1::INFO::2021-06-22 >> >>> 18:58:15,761::network::88::network.Network::(action) Successfully >> >>> verified network status >> >>> ... >> >>> >> >>> > > >> >>> > > - The command we run is 'dig +tries=1 +time=5', which defaults to >> >>> > > querying for '.' (the dns root). This is normally cached locally, but >> >>> > > has a TTL of 86400, meaning it can be cached for up to one day. So if >> >>> > > we ran this query right after it expired, _and_ then the local dns >> >>> > > server had some issues forwarding our request (due to external >> >>> > > issues, >> >>> > > perhaps), then it would fail like this. I am going to ignore this >> >>> > > failure for now, assuming it was temporary, but it might be worth >> >>> > > opening an RFE on ovirt-hosted-engine-ha asking for some more >> >>> > > flexibility - setting the query string or something similar. I think >> >>> > > that this bug is probably quite hard to reproduce, because normally, >> >>> > > all hosts will use the same dns server, and problems with it will >> >>> > > affect all of them similarly. >> >>> > > >> >>> > > - Anyway, it seems like there were temporary connectivity issues on >> >>> > > the network there. A minute later: >> >>> > > >> >>> > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/artifact/exported-artifacts/test_logs/he-basic-suite-master/lago-he-basic-suite-master-host-0/_var_log/ovirt-hosted-engine-ha/broker.log >> >>> > > >> >>> > > Thread-1::INFO::2021-06-08 >> >>> > > 05:08:08,143::network::88::network.Network::(action) Successfully >> >>> > > verified network status >> >>> > > >> >>> > > But that was too late and the engine VM was already on its way down. >> >>> > > >> >>> > > A remaining open question is whether we should retry before giving >> >>> > > up, >> >>> > > and where - in the SDK, in OST code, etc. - or whether this should be >> >>> > > considered normal. >> >>> >> >>> What do you think? >> >> >> >> >> >> Question is: is retry in place also on vdsm side? Because if it fails on >> >> vdsm, it's better to fail here as well. If there's a retry process in >> >> vdsm for all network calls, I think we can relax the check here and retry >> >> before giving up. >> > >> > >> > No idea, adding Eitan. >> >> I talked with Eitan about this in private, and he'll check. Thanks. >> >> This happened more often recently. >> >> I pushed a patch [1] to test the network alongside OST, and one of the >> CI check-patch runs for it also failed due to this reason [2] (check >> broker.log on host-0). The log generated by this patch [3] ends with >> "Passed 1311 out of 1338", meaning it lost 27 replies in less than an >> hour, which IMO is quite a lot. The latest version of the patch tries >> dig with '+tcp' - if that's enough to make it pass with (close to) >> zero losses, perhaps we can do the same in HA. >> >> Thanks and best regards, >> >> [1] https://gerrit.ovirt.org/c/ovirt-system-tests/+/115586 >> [2] >> https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17711/ >> [3] >> https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17711/artifact/check-patch.he-basic_suite_master.el8.x86_64/test_logs/lago-he-basic-suite-master-host-0/var/log/run_dig_loop.log >> -- >> Didi >> > _______________________________________________ > Infra mailing list -- [email protected] > To unsubscribe send an email to [email protected] > Privacy Statement: https://www.ovirt.org/privacy-policy.html > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > https://lists.ovirt.org/archives/list/[email protected]/message/64EOPWAPNDKTMA5JFUDII7D2SOTT3TY6/ _______________________________________________ Infra mailing list -- [email protected] To unsubscribe send an email to [email protected] Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/[email protected]/message/N3TDW75A5OAKFUEAGYAWOYEOTFJTJS5Q/
