On Tue, Jun 8, 2021 at 9:01 AM Yedidyah Bar David <[email protected]> wrote: > > On Tue, Jun 8, 2021 at 6:08 AM <[email protected]> wrote: > > > > Project: > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/ > > Build: > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/ > > Build Number: 2046 > > Build Status: Failure > > Triggered By: Started by timer > > > > ------------------------------------- > > Changes Since Last Success: > > ------------------------------------- > > Changes for Build #2046 > > [Eitan Raviv] network: force select spm - wait for dc status > > > > > > > > > > ----------------- > > Failed Tests: > > ----------------- > > 1 tests failed. > > FAILED: > > he-basic-suite-master.test-scenarios.test_004_basic_sanity.test_template_export > > > > Error Message: > > ovirtsdk4.Error: Failed to read response: [(<pycurl.Curl object at > > 0x5624fe64d108>, 7, 'Failed to connect to 192.168.200.99 port 443: > > Connection refused')] > > - The engine VM went down: > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/artifact/exported-artifacts/test_logs/he-basic-suite-master/lago-he-basic-suite-master-host-0/_var_log/ovirt-hosted-engine-ha/agent.log > > MainThread::INFO::2021-06-08 > 05:07:34,414::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) > Current state EngineUp (score: 3400) > MainThread::INFO::2021-06-08 > 05:07:44,575::states::135::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) > Penalizing score by 960 due to network status > > - Because HA monitoring failed to get a reply from the dns server: > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/artifact/exported-artifacts/test_logs/he-basic-suite-master/lago-he-basic-suite-master-host-0/_var_log/ovirt-hosted-engine-ha/broker.log > > Thread-1::WARNING::2021-06-08 > 05:07:25,486::network::120::network.Network::(_dns) DNS query failed: > ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 > ;; global options: +cmd > ;; connection timed out; no servers could be reached > > Thread-3::INFO::2021-06-08 > 05:07:28,543::mem_free::51::mem_free.MemFree::(action) memFree: 1801 > Thread-5::INFO::2021-06-08 > 05:07:28,972::engine_health::246::engine_health.EngineHealth::(_result_from_stats) > VM is up on this host with healthy engine > Thread-2::INFO::2021-06-08 > 05:07:31,532::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found > bridge ovirtmgmt in up state > Thread-1::WARNING::2021-06-08 > 05:07:33,011::network::120::network.Network::(_dns) DNS query failed: > ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 > ;; global options: +cmd > ;; connection timed out; no servers could be reached > > Thread-4::INFO::2021-06-08 > 05:07:37,433::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) > System load total=0.3196, engine=0.1724, non-engine=0.1472 > Thread-3::INFO::2021-06-08 > 05:07:37,839::mem_free::51::mem_free.MemFree::(action) memFree: 1735 > Thread-5::INFO::2021-06-08 > 05:07:39,146::engine_health::246::engine_health.EngineHealth::(_result_from_stats) > VM is up on this host with healthy engine > Thread-1::WARNING::2021-06-08 > 05:07:40,535::network::120::network.Network::(_dns) DNS query failed: > ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 > ;; global options: +cmd > ;; connection timed out; no servers could be reached > > Thread-1::WARNING::2021-06-08 > 05:07:40,535::network::92::network.Network::(action) Failed to verify > network status, (2 out of 5) > > - Not sure why. DNS servers: > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/artifact/exported-artifacts/test_logs/he-basic-suite-master/lago-he-basic-suite-master-host-0/_etc_resolv.conf > > # Generated by NetworkManager > search lago.local > nameserver 192.168.200.1 > nameserver fe80::5054:ff:fe0c:9ad0%ovirtmgmt > nameserver fd8f:1391:3a82:200::1 > > - The command we run is 'dig +tries=1 +time=5', which defaults to > querying for '.' (the dns root). This is normally cached locally, but > has a TTL of 86400, meaning it can be cached for up to one day. So if > we ran this query right after it expired, _and_ then the local dns > server had some issues forwarding our request (due to external issues, > perhaps), then it would fail like this. I am going to ignore this > failure for now, assuming it was temporary, but it might be worth > opening an RFE on ovirt-hosted-engine-ha asking for some more > flexibility - setting the query string or something similar. I think > that this bug is probably quite hard to reproduce, because normally, > all hosts will use the same dns server, and problems with it will > affect all of them similarly. > > - Anyway, it seems like there were temporary connectivity issues on > the network there. A minute later: > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/artifact/exported-artifacts/test_logs/he-basic-suite-master/lago-he-basic-suite-master-host-0/_var_log/ovirt-hosted-engine-ha/broker.log > > Thread-1::INFO::2021-06-08 > 05:08:08,143::network::88::network.Network::(action) Successfully > verified network status > > But that was too late and the engine VM was already on its way down. > > A remaining open question is whether we should retry before giving up, > and where - in the SDK, in OST code, etc. - or whether this should be > considered normal.
This now happened again [1] (with [2], for testing [3], but I don't think that's related): https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2049/artifact/exported-artifacts/test_logs/he-basic-suite-master/lago-he-basic-suite-master-host-0/_var_log/ovirt-hosted-engine-ha/broker.log : Sparing you the dns lines (you can search the log for details), but it happened twice in a few minutes. First one was non-fatal as it was "resolved" quickly: Thread-1::WARNING::2021-06-09 10:46:28,504::network::92::network.Network::(action) Failed to verify network status, (4 out of 5) Thread-1::INFO::2021-06-09 10:46:31,737::network::88::network.Network::(action) Successfully verified network status Second was "fatal" - caused the score to become low and the agent to stop the VM: Thread-1::WARNING::2021-06-09 10:50:26,809::network::120::network.Network::(_dns) DNS query failed: ... Thread-1::WARNING::2021-06-09 10:51:06,090::network::92::network.Network::(action) Failed to verify network status, (4 out of 5) Then, it did resolve, but this was too late: Thread-1::INFO::2021-06-09 10:51:09,292::network::88::network.Network::(action) Successfully verified network status So the network wasn't completely dead (4 of 5 failed, got better in less than a minute), but bad enough. [1] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2049/ [2] https://jenkins.ovirt.org/job/oVirt_ovirt-ansible-collection_standard-check-pr/685/ [3] https://github.com/oVirt/ovirt-ansible-collection/pull/277 Best regards, -- Didi _______________________________________________ Infra mailing list -- [email protected] To unsubscribe send an email to [email protected] Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/[email protected]/message/GZBKDWDN4572EMZ75UG5YVLRYVKTKG7F/
