Only some of my findings regarding this problem.
The high load compared to 1.4 comes from the new style of precompiled
checks.
In 1.4 every host had one big precompiled python file with all checks
included.
1.5 now has only one small file with include statements for all the checks
needed by this host.

Now the problem. Every time your Nagios core checks one host the system has
to load all the small precompiled check files and run this check.
This looks like very CPU intensive. The CMC core in the enterprise edition
has a python cache that it don't need to load all the small parts every
time a check is executed.

I have no real solution for this problem. This is only what i saw on some
of my machines with CRE edition and 1.5.

Best regards
Andreas

Am Do., 15. Aug. 2019 um 15:41 Uhr schrieb Joshua M. Boniface <
jos...@boniface.me>:

> Fair enough!
>
> I've left both settings to stew overnight, but it doesn't look to have
> made any substantial change in the load. The precompiled checks just fire
> every 90s instead of every 60s, but load is still through the roof whenever
> they do.
>
> I'm hoping someone else on the list has any idea of how to break this down
> and see what might be the culprit. I have a number of custom checks that
> this may be related to, but I don't have a way of knowing since it's just
> this one binary that gets run.
>
> Thanks,
> Joshua
> On 2019-08-14 3:52 p.m., Brian Binder wrote:
>
> I was lazy after doing it, really.
> The 2 settings worked and I haven’t revisited it since.
> I’m sure I should ;)
>
> On Aug 14, 2019, at 10:48 AM, Joshua M. Boniface <jos...@boniface.me>
> wrote:
>
> Thanks Brian, I'll increase it from 5s to 15s and see how it looks. It
> seems curious to me though that the timeout would have an effect on the
> load of spawning all these processes though, do you know why that might be
> the case? Or was this just a case of finding that it worked?
> On 2019-08-14 11:46 a.m., Brian Binder wrote:
>
> The 2 things I listed worked for me. Start with the TCP connect timeout if
> you don’t want to do the increase in check interval. Then see how your CPU
> looks for the day with only making 1 modification.
> On Aug 14, 2019, 10:42 AM -0500, Joshua M. Boniface <jos...@boniface.me>
> <jos...@boniface.me>, wrote:
>
> Hello list!
>
> I've been using CMK 1.4 for a few years now, and 1.5 recently, and I've
> noticed a trend in increasing CPU usage from the
> precompiled checks.
>
> I reviewed a previous thread on this topic (
> https://lists.mathias-kettner.de/pipermail/checkmk-en/2019-May/027837.html
> ),
> but it doesn't look like there was ever a solution there, and I'd like to
> dig more into it. In my case, I've noticed
> these spikes in 1.4 as well, so I don't think it's really a new version
> problem.
>
> To recap what's happening, I've got 114 hosts and 4369 services total
> (almost all of them passive via the agent). Every
> 60s, as expected, the checks run against all hosts and services. This
> results in a massive spike in CPU utilization,
> with the ~100 or so instances of `/omd/sites/monitor/bin/python
> /omd/sites/monitor/var/check_mk/precompiled/<host>`
> firing at once. Each one is using ~30% CPU as reported by htop. The spike
> in CPU isage is very noticeable on my
> relatively slow/old hypervisors, and I see a dramatic increase in
> hypervisor power usage as well due to the load spike,
> with the load of the VM in question sitting at an almost constant
> 15-minute value of 15+.
>
> I understand what's going on here (processing the data from all those
> checks needs CPU after all), but I'm wondering if
> there's any way to break this down by-check, to see what specific
> checks(s) might be responsible for so much CPU usage,
> and if there's ways to smooth this out a bit so that there isn't a sudden
> burst of 100+ processes that therefore take
> longer and load the CPU higher. Maybe firing 1/6 of them every 10 sceonds
> instead of all every 60s or something of that
> nature.
>
> I've tried the usual suggestions here (set a less frequent inventory
> interval [1 day], disable on-demand compiling) but
> I *haven't* reduced the check interval: I want 60s check intervals (and
> would prefer even more frequent if this seemed
> feasible).
> For power saving I've also considered putting my CMK instance on a
> dedicated Raspberry Pi 3 as well, but given this huge
> CPU usage I'm not confident it would even be able to run smoothly due to
> this, so I'd like to get to the bottom of it.
>
> Anyone have any advice on the troubleshooting/balancing aspects of this
> they can share?
>
> Thanks for reading,
> Joshua
>
>
>
>
> _______________________________________________
> checkmk-en mailing list
> checkmk-en@lists.mathias-kettner.de
> Manage your subscription or unsubscribe
> https://lists.mathias-kettner.de/cgi-bin/mailman/listinfo/checkmk-en
>
>
> _______________________________________________
> checkmk-en mailing list
> checkmk-en@lists.mathias-kettner.de
> Manage your subscription or unsubscribe
> https://lists.mathias-kettner.de/cgi-bin/mailman/listinfo/checkmk-en
_______________________________________________
checkmk-en mailing list
checkmk-en@lists.mathias-kettner.de
Manage your subscription or unsubscribe
https://lists.mathias-kettner.de/cgi-bin/mailman/listinfo/checkmk-en

Reply via email to