On Wed, Oct 25, 2023 at 05:24:50PM +0200, Mike Fischer wrote: > > > Am 25.10.2023 um 17:07 schrieb Theo de Raadt <dera...@openbsd.org>: > > > > Claudio Jeker <cje...@diehard.n-r-g.com> wrote: > > > >> On Wed, Oct 25, 2023 at 11:57:54AM +0200, Mike Fischer wrote: > >>> I have been observing occasional bouts of high load averages on several > >>> servers I administer and I am trying to find the cause. (I monitor these > >>> machines so that I can implement corrective measures in case of any > >>> malicious or abnormal activity. I think this is benign, but I’d still > >>> like to find the cause.) > >>> > >>> Once the high load average starts, only a reboot seems to (temporarily) > >>> return the values to their normal levels. > >>> > >>> The actual CPU usage (as measured by vmstat) stays low even if the load > >>> average is elevated. > >>> > >>> The servers are VMs running on a VMWare host (ESXi). This was seen with > >>> OpenBSD 7.3 and 7.4 amd64. > >>> > >>> I can not determine anything inside the VM that causes this. There seems > >>> to be no correlation to pfstat(8) graphs, log entries, known events, or > >>> anything else I can determine. restarting all of the rc.d services never > >>> made any difference. > >>> > >>> Could this be caused by something on the VMWare host machine? (The host > >>> seems to be operating at limit regarding RAM for example. But the VM is > >>> only using the normal percentage of its allocated RAM — way below 100% > >>> and very constant usage, no swap.) > >>> > >>> How can I further debug this, keeping in mind that these are production > >>> machines and experimentation is limited to benign things that don’t > >>> cause outages. > >>> > >> > >> What is high? A high CPU load for me is in the order of 70+. > >> Please remember the CPU load avarage is a horrible leftover from tenex > >> days. The system just counts how many processes are runnable but it is a > >> very bad indicator of actual CPU load. > > > > Furthermore, every operating system counts this in a different way. > > You might think there is only one way to count it. Not at all. > > True. But like I said, this was noticed because of the sudden increase > on the same (OpenBSD) machine without any obvious reason. I am not > implying that the value of 0.7 is in any way critical. Just that an > increase from a long time load average of 0.0x to 0.7x is noteworthy. I > have no issue when the load increases when a machine is handling > requests or doing something I know about. But then the load should drop > back to normal levels once the task is finished. That did not happen in > the cases I’m trying to figure out.
I process that is started every 5 seconds and exits after 10ms computation can cause the load to go up by 1. It just matters if it runs during the sampling time or not. This is why the load avarage is not accurate, it is an indication and if the value is below the number of CPUs you may well see quantization errors. So yes, maybe there is something going on but even top -s .1 -I will have a hard time to show it to you. It may be to small of a blib to spot. -- :wq Claudio