[Nagios-users] Have we reached some kind of Nagios limit?

Frost, Mark {BIS} Sat, 18 Feb 2012 10:26:06 -0800

A couple of days ago, I ran into a problem I've never seen before.  We run a 
single large instance with mostly very heterogeneous checks and host types.  
One particular group of Windows hosts, however, are all quite similar and they, 
like most of our other checks rely on the use of templates.  I needed to add 10 
more hosts of this particular type and typically all I have to do is just 
define the hosts and the service checks happen automatically as the host 
templates include them in a group that includes all the relevant checks.


I added maybe 5 of these new hosts, ran the pre-flight check and restarted.  
After the restart I started noticing that our failing service checks (for all 
services) went from around 260 to over 4K.  All of those new failing checks 
were only on hosts of this same type (that particular application on Windows 
servers I mentioned above which is also what these new hosts were part of) and 
they were all reporting the same failure condition:

(Return code of 127 is out of bounds - plugin may be missing)

Now ordinarily this would indicate a client-side issue, but there isn't one.  I 
can validate that by running check_nrpe manually against any of these hosts.   
I could imagine a typo that would cause this, particular against other existing 
hosts that had not been touched, but I double-checked and did not find one (I 
was just adding host definitions to this group - nothing else).

I cloned this environment and went to play with it in a non-production instance 
that was identical to the production Nagios instance except for a slight newer 
version of Merlin in the backend (1.1.14 for the non-prod instance, 1.1.13 
something for the production one), but both used the same Nagios 3.3.1 + 
downtime locking patches.   I was able to reproduce the situation and after a 
couple of days of trial and error I've still not been able to completely 
isolate the issue, but I've determined that

-       it's not got anything to do with the mk-livestatus module (turned it 
off, turned it back on), but it's been very helpful in figuring out which of 
the 13K+ services and 1200+ hosts are impacted
-       it doesn't seem to be about adding random hosts and services.   I can 
add others and this doesn't happen
-       the host definition uses a template that puts the host in a hostgroup.  
Those hostgroups are then used to in service definitions (12-15 services, 
depending on which group).   I had thought that perhaps if the hostgroup_name 
line of the service definition expanded to too many hosts that could be the 
problem.  I broke the service definitions down into 2 definitions, one for each 
production hostgroup rather than combining them and that didn't matter.
-       the service templates that the service definitions use for these hosts 
all add them to a common servicegroup.  My current line of thinking leads me to 
believe it's got something to do with this.   With a particular test scenario I 
created where I create a new host, but exclude it from the hostgroup 
definitions and instead manually create service definitions for this host (I 
know this "one more host" is right on the cusp of this problem), I find that 
when I add it so the 4,331st service gets added to the servicegroup, the 
problem starts.  If I remove that from that host's service definition all the 
other hosts' services recover.   However, based on this thinking, if I just 
comment out the servicegroup add from the service template these hosts use, the 
problem should stop - it doesn't.
-       the only affect services are on all of the hostgroup I'm changing.   
Other unrelated hosts and services are unaffected.   There are 3 hostgroups: 
Production Appname Hosts 1, Production Appname Hosts 2, and All Appname Hosts 
which is obviously a combination of the two.   All Appname Hosts is around 324 
hosts.

I'm not really sure what to try at this point.  It does seem like I've hit some 
kind of internal limitation with Nagios, but I don't know how to determine 
anything else about it beyond this.  If I were able to completely isolate this 
to say, not adding anything to a single servicegroup, I could avoid that and 
continue adding hosts as we need it, but I have so far not been able to find 
such a workaround.   If there is a limitation like this, it would of course, be 
nice for the pre-flight check to tell me that I can't have more than X members 
of a servicegroup or something.

Other info:

Nagios version: Nagios 3.3.1 with locking patches
Merlin backend: 1.1.13+ (production), 1.1.14 (test)
MK-Livestatus module 1.1.12p6 installed (uninstalled doesn't impact)
OS: SLES 11.1 Linux, 64-bit
Memory: 12GB
CPU: 2x 2.4Ghz quad-core Xeon

What can I do?

Thanks

Mark

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/

_______________________________________________
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

[Nagios-users] Have we reached some kind of Nagios limit?

Reply via email to