> On 4. 8. 2021, at 7:38, Yedidyah Bar David <d...@redhat.com> wrote:
> 
> On Tue, Aug 3, 2021 at 10:27 PM Michal Skrivanek <michal.skriva...@redhat.com 
> <mailto:michal.skriva...@redhat.com>> wrote:
> 
> 
>> On 3. 8. 2021, at 11:43, Yedidyah Bar David <d...@redhat.com 
>> <mailto:d...@redhat.com>> wrote:
>> 
>> On Tue, Aug 3, 2021 at 10:05 AM Yedidyah Bar David <d...@redhat.com 
>> <mailto:d...@redhat.com>> wrote:
>>> 
>>> On Tue, Aug 3, 2021 at 7:50 AM <jenk...@jenkins.phx.ovirt.org 
>>> <mailto:jenk...@jenkins.phx.ovirt.org>> wrote:
>>>> 
>>>> Project: 
>>>> https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/ 
>>>> <https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/>
>>>> Build: 
>>>> https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2126/
>>>>  
>>>> <https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2126/>
>>>> Build Number: 2126
>>>> Build Status:  Failure
>>>> Triggered By: Started by timer
>>>> 
>>>> -------------------------------------
>>>> Changes Since Last Success:
>>>> -------------------------------------
>>>> Changes for Build #2126
>>>> [Michal Skrivanek] basic: skipping just the VNC console part of 
>>>> test_virtual_machines
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -----------------
>>>> Failed Tests:
>>>> -----------------
>>>> 2 tests failed.
>>>> FAILED:  
>>>> he-basic-suite-master.test-scenarios.test_012_local_maintenance_sdk.test_local_maintenance
>>>> 
>>>> Error Message:
>>>> ovirtsdk4.Error: Failed to read response: [(<pycurl.Curl object at 
>>>> 0x5555faf11228>, 7, 'Failed to connect to 192.168.200.99 port 443: 
>>>> Connection refused')]
>>> 
>>> This looks very similar to the issue we have with dns/dig failures
>>> that cause the engine VM to go down, and it's similar, but different.
>>> 
>>> dig didn't fail (it now uses TCP), but something else caused the agent
>>> to stop the engine VM - a combination of high cpu load and low free
>>> memory, after restarting the engine VM as part of test_008.
>>> 
>>> https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2126/artifact/exported-artifacts/test_logs/ost-he-basic-suite-master-host-0/var/log/ovirt-hosted-engine-ha/agent.log
>>>  
>>> <https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2126/artifact/exported-artifacts/test_logs/ost-he-basic-suite-master-host-0/var/log/ovirt-hosted-engine-ha/agent.log>
>>> :
>>> 
>>> =====================================================================================
>>> MainThread::INFO::2021-08-03
>>> 06:46:55,068::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>>> Current state ReinitializeFSM (score: 0)
>>> MainThread::INFO::2021-08-03
>>> 06:47:04,089::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
>>> Global maintenance detected
>>> MainThread::INFO::2021-08-03
>>> 06:47:04,169::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>>> Success, was notification of state_transition
>>> (ReinitializeFSM-GlobalMaintenance) sent? ignored
>>> MainThread::INFO::2021-08-03
>>> 06:47:05,249::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>>> Current state GlobalMaintenance (score: 3400)
>>> MainThread::INFO::2021-08-03
>>> 06:47:14,439::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
>>> Global maintenance detected
>>> MainThread::INFO::2021-08-03
>>> 06:47:25,526::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
>>> Penalizing score by 814 due to cpu load
>>> MainThread::INFO::2021-08-03
>>> 06:47:25,527::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>>> Current state GlobalMaintenance (score: 2586)
>>> MainThread::INFO::2021-08-03
>>> 06:47:25,537::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
>>> Global maintenance detected
>>> MainThread::INFO::2021-08-03
>>> 06:47:26,029::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>>> Current state GlobalMaintenance (score: 2586)
>>> MainThread::INFO::2021-08-03
>>> 06:47:35,050::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
>>> Global maintenance detected
>>> MainThread::INFO::2021-08-03
>>> 06:47:35,576::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>>> Current state GlobalMaintenance (score: 2586)
>>> MainThread::INFO::2021-08-03
>>> 06:47:45,597::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
>>> Global maintenance detected
>>> MainThread::INFO::2021-08-03
>>> 06:47:46,521::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>>> Current state GlobalMaintenance (score: 2586)
>>> MainThread::INFO::2021-08-03
>>> 06:47:55,577::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
>>> Global maintenance detected
>>> MainThread::INFO::2021-08-03
>>> 06:47:56,559::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>>> Current state GlobalMaintenance (score: 2586)
>>> MainThread::INFO::2021-08-03
>>> 06:47:56,559::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>>> Best remote host ost-he-basic-suite-master-host-1 (id: 2, score: 3400)
>>> MainThread::INFO::2021-08-03
>>> 06:48:05,633::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
>>> Global maintenance detected
>>> MainThread::INFO::2021-08-03
>>> 06:48:06,188::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
>>> Penalizing score by 820 due to cpu load
>>> MainThread::INFO::2021-08-03
>>> 06:48:06,188::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>>> Current state GlobalMaintenance (score: 2580)
>>> MainThread::INFO::2021-08-03
>>> 06:48:16,256::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
>>> Global maintenance detected
>>> MainThread::INFO::2021-08-03
>>> 06:48:16,950::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
>>> Penalizing score by 831 due to cpu load
>>> MainThread::INFO::2021-08-03
>>> 06:48:16,951::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>>> Current state GlobalMaintenance (score: 2569)
>>> MainThread::INFO::2021-08-03
>>> 06:48:26,053::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
>>> Global maintenance detected
>>> MainThread::INFO::2021-08-03
>>> 06:48:26,999::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
>>> Penalizing score by 839 due to cpu load
>>> MainThread::INFO::2021-08-03
>>> 06:48:26,999::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>>> Current state GlobalMaintenance (score: 2561)
>>> MainThread::INFO::2021-08-03
>>> 06:48:36,026::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
>>> Global maintenance detected
>>> MainThread::INFO::2021-08-03
>>> 06:48:36,802::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
>>> Penalizing score by 844 due to cpu load
>>> MainThread::INFO::2021-08-03
>>> 06:48:36,802::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>>> Current state GlobalMaintenance (score: 2556)
>>> MainThread::INFO::2021-08-03
>>> 06:48:45,827::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
>>> Global maintenance detected
>>> MainThread::INFO::2021-08-03
>>> 06:48:46,401::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
>>> Penalizing score by 849 due to cpu load
>>> MainThread::INFO::2021-08-03
>>> 06:48:46,401::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>>> Current state GlobalMaintenance (score: 2551)
>>> MainThread::INFO::2021-08-03
>>> 06:48:56,588::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>>> Success, was notification of state_transition
>>> (GlobalMaintenance-ReinitializeFSM) sent? ignored
>>> MainThread::INFO::2021-08-03
>>> 06:48:58,685::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>>> Current state ReinitializeFSM (score: 0)
>>> MainThread::INFO::2021-08-03
>>> 06:49:05,729::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>>> Success, was notification of state_transition
>>> (ReinitializeFSM-EngineStarting) sent? ignored
>>> MainThread::INFO::2021-08-03
>>> 06:49:06,146::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
>>> Penalizing score by 830 due to cpu load
>>> MainThread::INFO::2021-08-03
>>> 06:49:06,146::states::72::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_penalize_memory)
>>> Penalizing score by 400 due to free memory 1782 being lower than
>>> required 3171
>>> MainThread::INFO::2021-08-03
>>> 06:49:06,146::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>>> Current state EngineStarting (score: 2170)
>>> MainThread::INFO::2021-08-03
>>> 06:49:06,150::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
>>> Timeout cleared while transitioning <class
>>> 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class
>>> 'ovirt_hosted_engine_ha.agent.states.EngineUp'>
>>> MainThread::INFO::2021-08-03
>>> 06:49:06,172::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>>> Success, was notification of state_transition
>>> (EngineStarting-EngineUp) sent? ignored
>>> MainThread::INFO::2021-08-03
>>> 06:49:06,178::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>>> Current state EngineUp (score: 2570)
>>> MainThread::ERROR::2021-08-03
>>> 06:49:16,197::states::398::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
>>> Host ost-he-basic-suite-master-host-1 (id 2) score is significantly
>>> better than local score, shutting down VM on this host
>>> =====================================================================================
>>> 
>>> I looked a bit at /var/log/messages of the host, and while there is
>>> quite some noise there, can't tell specifically what might have caused
>>> the high cpu load.
>>> 
>>> Also looked at logs of previous runs, and there is indeed a similar
>>> pattern, where the cpuload on the host causes the agent to penalize
>>> the score, but according to the agent log the load goes down faster,
>>> up to the point of being quite low when the engine is up, and only
>>> then we exit global maintenance, allowing the agent to take actions.
>>> 
>>> I now ran it again, but think that otherwise I'll ignore this for now,
>>> unless we see more similar failures. If we do, we might want to
>>> check/monitor/log the cpu load on the hosts, and/or change test_008 to
>>> wait, after the engine is up, until the cpu load on the host goes down
>>> a bit. Before the port to pytest we waited a hard-coded 5 minutes and
>>> I
>>> changed that then to only wait until the engine VM is not migrating,
>>> and this worked more-or-less ok. We might need to refine that,
>>> although I'd rather not introduce another arbitrary long delays but
>>> wait for some condition.
>>> 
>>> I also noticed that this ran was started at "03:04 AM" (UTC), by
>>> timer, whereas all previous timer-based runs started at 01:55 AM,
>>> somewhat earlier - perhaps there are some other things that run at
>>> that time that cause load.
>> 
>> Checked some more and found another difference: The total cpu use was
>> not very different between the failed and the last successful runs,
>> but load caused by the engine vm was different between them - in the
>> failed run, engine kept being '0.0' for a longer time, compared with
>> the successful run. Since the agent tries to ignore the load by the
>> engine vm, the final result is significantly different. The load-by
>> -engine-vm is calculated based on the result of a VDSM getStats call
>> on the engine VM. Indeed, in vdsm.log, cpuUser and cpuSys are both
>> starting with '0.00' for both runs, but the failed one is '0.00' for
>> a longer time - 45 seconds between VM.create and first-non-zero
>> cpuUser on the successful run, vs 93 seconds for the failed one.
>> Does this make sense (from VDSM's POV)? If so, perhaps there is
>> a way to get from VDSM the answer to this question: "Is this 0.00
>> CPU usage a result of still-initial, non-accurate, non-complete
>> monitoring? Or is it really not using the CPU at all?" and use it
>> in the agent somehow.
> 
> I think it’s futile to try to understand that on jenkins.ovirt.org 
> <http://jenkins.ovirt.org/> runs. It’s too unpredictable since it runs in a 
> shared environment, you can’t know what’s going on on the same physical host.
> 
> I understand that I can't get/expect stable performance/load there. That's 
> not my question. I asked about VDSM. I don't know the code there well. I 
> assume that when it starts a VM, it initializes something somewhere with 
> 'cpuUser: 0.00', and once its gets _some_ data, whatever that might be, it 
> starts reporting (based on) this data. My question was if there is a reliable 
> way to know if '0.00' is just real data, or initial-not-yet-real data.

I don’t really know for sure, but AFAICT it should be real data from the start.
Maybe for the first interval, but afterwards it’s always a libvirt reported 
value
 
> One of the reasons we try to move away from it to beaker-based runs that run 
> dedicated.
> The problem is you have no metrics for the actual host, so the numbers OST or 
> vdsm see can be very misleading
> We do have sar monitoring of OST VMs, we do not capture anything from the 
> host (the host running OST code), but I would guess that even if we would, 
> then for mock env it’s probably going to be misleading anyway.
> 
> Again: Perhaps the actual reported load would be misleading, because it's on 
> a VM, which does not own its CPUs. I'd still expect '0.00' to mean that, or 
> have a way to know that it actually means "I don't know yet".
> 
> Best regards,
> -- 
> Didi

_______________________________________________
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/M3DU3LJ42Q3ZFZEKMYTUPSU3UI6JCC6M/

Reply via email to