> On 4. 8. 2021, at 7:38, Yedidyah Bar David <d...@redhat.com> wrote: > > On Tue, Aug 3, 2021 at 10:27 PM Michal Skrivanek <michal.skriva...@redhat.com > <mailto:michal.skriva...@redhat.com>> wrote: > > >> On 3. 8. 2021, at 11:43, Yedidyah Bar David <d...@redhat.com >> <mailto:d...@redhat.com>> wrote: >> >> On Tue, Aug 3, 2021 at 10:05 AM Yedidyah Bar David <d...@redhat.com >> <mailto:d...@redhat.com>> wrote: >>> >>> On Tue, Aug 3, 2021 at 7:50 AM <jenk...@jenkins.phx.ovirt.org >>> <mailto:jenk...@jenkins.phx.ovirt.org>> wrote: >>>> >>>> Project: >>>> https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/ >>>> <https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/> >>>> Build: >>>> https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2126/ >>>> >>>> <https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2126/> >>>> Build Number: 2126 >>>> Build Status: Failure >>>> Triggered By: Started by timer >>>> >>>> ------------------------------------- >>>> Changes Since Last Success: >>>> ------------------------------------- >>>> Changes for Build #2126 >>>> [Michal Skrivanek] basic: skipping just the VNC console part of >>>> test_virtual_machines >>>> >>>> >>>> >>>> >>>> ----------------- >>>> Failed Tests: >>>> ----------------- >>>> 2 tests failed. >>>> FAILED: >>>> he-basic-suite-master.test-scenarios.test_012_local_maintenance_sdk.test_local_maintenance >>>> >>>> Error Message: >>>> ovirtsdk4.Error: Failed to read response: [(<pycurl.Curl object at >>>> 0x5555faf11228>, 7, 'Failed to connect to 192.168.200.99 port 443: >>>> Connection refused')] >>> >>> This looks very similar to the issue we have with dns/dig failures >>> that cause the engine VM to go down, and it's similar, but different. >>> >>> dig didn't fail (it now uses TCP), but something else caused the agent >>> to stop the engine VM - a combination of high cpu load and low free >>> memory, after restarting the engine VM as part of test_008. >>> >>> https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2126/artifact/exported-artifacts/test_logs/ost-he-basic-suite-master-host-0/var/log/ovirt-hosted-engine-ha/agent.log >>> >>> <https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2126/artifact/exported-artifacts/test_logs/ost-he-basic-suite-master-host-0/var/log/ovirt-hosted-engine-ha/agent.log> >>> : >>> >>> ===================================================================================== >>> MainThread::INFO::2021-08-03 >>> 06:46:55,068::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) >>> Current state ReinitializeFSM (score: 0) >>> MainThread::INFO::2021-08-03 >>> 06:47:04,089::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) >>> Global maintenance detected >>> MainThread::INFO::2021-08-03 >>> 06:47:04,169::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) >>> Success, was notification of state_transition >>> (ReinitializeFSM-GlobalMaintenance) sent? ignored >>> MainThread::INFO::2021-08-03 >>> 06:47:05,249::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) >>> Current state GlobalMaintenance (score: 3400) >>> MainThread::INFO::2021-08-03 >>> 06:47:14,439::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) >>> Global maintenance detected >>> MainThread::INFO::2021-08-03 >>> 06:47:25,526::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) >>> Penalizing score by 814 due to cpu load >>> MainThread::INFO::2021-08-03 >>> 06:47:25,527::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) >>> Current state GlobalMaintenance (score: 2586) >>> MainThread::INFO::2021-08-03 >>> 06:47:25,537::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) >>> Global maintenance detected >>> MainThread::INFO::2021-08-03 >>> 06:47:26,029::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) >>> Current state GlobalMaintenance (score: 2586) >>> MainThread::INFO::2021-08-03 >>> 06:47:35,050::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) >>> Global maintenance detected >>> MainThread::INFO::2021-08-03 >>> 06:47:35,576::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) >>> Current state GlobalMaintenance (score: 2586) >>> MainThread::INFO::2021-08-03 >>> 06:47:45,597::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) >>> Global maintenance detected >>> MainThread::INFO::2021-08-03 >>> 06:47:46,521::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) >>> Current state GlobalMaintenance (score: 2586) >>> MainThread::INFO::2021-08-03 >>> 06:47:55,577::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) >>> Global maintenance detected >>> MainThread::INFO::2021-08-03 >>> 06:47:56,559::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) >>> Current state GlobalMaintenance (score: 2586) >>> MainThread::INFO::2021-08-03 >>> 06:47:56,559::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) >>> Best remote host ost-he-basic-suite-master-host-1 (id: 2, score: 3400) >>> MainThread::INFO::2021-08-03 >>> 06:48:05,633::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) >>> Global maintenance detected >>> MainThread::INFO::2021-08-03 >>> 06:48:06,188::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) >>> Penalizing score by 820 due to cpu load >>> MainThread::INFO::2021-08-03 >>> 06:48:06,188::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) >>> Current state GlobalMaintenance (score: 2580) >>> MainThread::INFO::2021-08-03 >>> 06:48:16,256::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) >>> Global maintenance detected >>> MainThread::INFO::2021-08-03 >>> 06:48:16,950::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) >>> Penalizing score by 831 due to cpu load >>> MainThread::INFO::2021-08-03 >>> 06:48:16,951::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) >>> Current state GlobalMaintenance (score: 2569) >>> MainThread::INFO::2021-08-03 >>> 06:48:26,053::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) >>> Global maintenance detected >>> MainThread::INFO::2021-08-03 >>> 06:48:26,999::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) >>> Penalizing score by 839 due to cpu load >>> MainThread::INFO::2021-08-03 >>> 06:48:26,999::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) >>> Current state GlobalMaintenance (score: 2561) >>> MainThread::INFO::2021-08-03 >>> 06:48:36,026::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) >>> Global maintenance detected >>> MainThread::INFO::2021-08-03 >>> 06:48:36,802::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) >>> Penalizing score by 844 due to cpu load >>> MainThread::INFO::2021-08-03 >>> 06:48:36,802::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) >>> Current state GlobalMaintenance (score: 2556) >>> MainThread::INFO::2021-08-03 >>> 06:48:45,827::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) >>> Global maintenance detected >>> MainThread::INFO::2021-08-03 >>> 06:48:46,401::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) >>> Penalizing score by 849 due to cpu load >>> MainThread::INFO::2021-08-03 >>> 06:48:46,401::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) >>> Current state GlobalMaintenance (score: 2551) >>> MainThread::INFO::2021-08-03 >>> 06:48:56,588::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) >>> Success, was notification of state_transition >>> (GlobalMaintenance-ReinitializeFSM) sent? ignored >>> MainThread::INFO::2021-08-03 >>> 06:48:58,685::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) >>> Current state ReinitializeFSM (score: 0) >>> MainThread::INFO::2021-08-03 >>> 06:49:05,729::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) >>> Success, was notification of state_transition >>> (ReinitializeFSM-EngineStarting) sent? ignored >>> MainThread::INFO::2021-08-03 >>> 06:49:06,146::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) >>> Penalizing score by 830 due to cpu load >>> MainThread::INFO::2021-08-03 >>> 06:49:06,146::states::72::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_penalize_memory) >>> Penalizing score by 400 due to free memory 1782 being lower than >>> required 3171 >>> MainThread::INFO::2021-08-03 >>> 06:49:06,146::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) >>> Current state EngineStarting (score: 2170) >>> MainThread::INFO::2021-08-03 >>> 06:49:06,150::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) >>> Timeout cleared while transitioning <class >>> 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class >>> 'ovirt_hosted_engine_ha.agent.states.EngineUp'> >>> MainThread::INFO::2021-08-03 >>> 06:49:06,172::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) >>> Success, was notification of state_transition >>> (EngineStarting-EngineUp) sent? ignored >>> MainThread::INFO::2021-08-03 >>> 06:49:06,178::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) >>> Current state EngineUp (score: 2570) >>> MainThread::ERROR::2021-08-03 >>> 06:49:16,197::states::398::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) >>> Host ost-he-basic-suite-master-host-1 (id 2) score is significantly >>> better than local score, shutting down VM on this host >>> ===================================================================================== >>> >>> I looked a bit at /var/log/messages of the host, and while there is >>> quite some noise there, can't tell specifically what might have caused >>> the high cpu load. >>> >>> Also looked at logs of previous runs, and there is indeed a similar >>> pattern, where the cpuload on the host causes the agent to penalize >>> the score, but according to the agent log the load goes down faster, >>> up to the point of being quite low when the engine is up, and only >>> then we exit global maintenance, allowing the agent to take actions. >>> >>> I now ran it again, but think that otherwise I'll ignore this for now, >>> unless we see more similar failures. If we do, we might want to >>> check/monitor/log the cpu load on the hosts, and/or change test_008 to >>> wait, after the engine is up, until the cpu load on the host goes down >>> a bit. Before the port to pytest we waited a hard-coded 5 minutes and >>> I >>> changed that then to only wait until the engine VM is not migrating, >>> and this worked more-or-less ok. We might need to refine that, >>> although I'd rather not introduce another arbitrary long delays but >>> wait for some condition. >>> >>> I also noticed that this ran was started at "03:04 AM" (UTC), by >>> timer, whereas all previous timer-based runs started at 01:55 AM, >>> somewhat earlier - perhaps there are some other things that run at >>> that time that cause load. >> >> Checked some more and found another difference: The total cpu use was >> not very different between the failed and the last successful runs, >> but load caused by the engine vm was different between them - in the >> failed run, engine kept being '0.0' for a longer time, compared with >> the successful run. Since the agent tries to ignore the load by the >> engine vm, the final result is significantly different. The load-by >> -engine-vm is calculated based on the result of a VDSM getStats call >> on the engine VM. Indeed, in vdsm.log, cpuUser and cpuSys are both >> starting with '0.00' for both runs, but the failed one is '0.00' for >> a longer time - 45 seconds between VM.create and first-non-zero >> cpuUser on the successful run, vs 93 seconds for the failed one. >> Does this make sense (from VDSM's POV)? If so, perhaps there is >> a way to get from VDSM the answer to this question: "Is this 0.00 >> CPU usage a result of still-initial, non-accurate, non-complete >> monitoring? Or is it really not using the CPU at all?" and use it >> in the agent somehow. > > I think it’s futile to try to understand that on jenkins.ovirt.org > <http://jenkins.ovirt.org/> runs. It’s too unpredictable since it runs in a > shared environment, you can’t know what’s going on on the same physical host. > > I understand that I can't get/expect stable performance/load there. That's > not my question. I asked about VDSM. I don't know the code there well. I > assume that when it starts a VM, it initializes something somewhere with > 'cpuUser: 0.00', and once its gets _some_ data, whatever that might be, it > starts reporting (based on) this data. My question was if there is a reliable > way to know if '0.00' is just real data, or initial-not-yet-real data.
I don’t really know for sure, but AFAICT it should be real data from the start. Maybe for the first interval, but afterwards it’s always a libvirt reported value > One of the reasons we try to move away from it to beaker-based runs that run > dedicated. > The problem is you have no metrics for the actual host, so the numbers OST or > vdsm see can be very misleading > We do have sar monitoring of OST VMs, we do not capture anything from the > host (the host running OST code), but I would guess that even if we would, > then for mock env it’s probably going to be misleading anyway. > > Again: Perhaps the actual reported load would be misleading, because it's on > a VM, which does not own its CPUs. I'd still expect '0.00' to mean that, or > have a way to know that it actually means "I don't know yet". > > Best regards, > -- > Didi
_______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/M3DU3LJ42Q3ZFZEKMYTUPSU3UI6JCC6M/