A couple of days ago, I ran into a problem I've never seen before. We run a
single large instance with mostly very heterogeneous checks and host types.
One particular group of Windows hosts, however, are all quite similar and they,
like most of our other checks rely on the use of templates. I needed to add 10
more hosts of this particular type and typically all I have to do is just
define the hosts and the service checks happen automatically as the host
templates include them in a group that includes all the relevant checks.
I added maybe 5 of these new hosts, ran the pre-flight check and restarted.
After the restart I started noticing that our failing service checks (for all
services) went from around 260 to over 4K. All of those new failing checks
were only on hosts of this same type (that particular application on Windows
servers I mentioned above which is also what these new hosts were part of) and
they were all reporting the same failure condition:
(Return code of 127 is out of bounds - plugin may be missing)
Now ordinarily this would indicate a client-side issue, but there isn't one. I
can validate that by running check_nrpe manually against any of these hosts.
I could imagine a typo that would cause this, particular against other existing
hosts that had not been touched, but I double-checked and did not find one (I
was just adding host definitions to this group - nothing else).
I cloned this environment and went to play with it in a non-production instance
that was identical to the production Nagios instance except for a slight newer
version of Merlin in the backend (1.1.14 for the non-prod instance, 1.1.13
something for the production one), but both used the same Nagios 3.3.1 +
downtime locking patches. I was able to reproduce the situation and after a
couple of days of trial and error I've still not been able to completely
isolate the issue, but I've determined that
- it's not got anything to do with the mk-livestatus module (turned it
off, turned it back on), but it's been very helpful in figuring out which of
the 13K+ services and 1200+ hosts are impacted
- it doesn't seem to be about adding random hosts and services. I can
add others and this doesn't happen
- the host definition uses a template that puts the host in a hostgroup.
Those hostgroups are then used to in service definitions (12-15 services,
depending on which group). I had thought that perhaps if the hostgroup_name
line of the service definition expanded to too many hosts that could be the
problem. I broke the service definitions down into 2 definitions, one for each
production hostgroup rather than combining them and that didn't matter.
- the service templates that the service definitions use for these hosts
all add them to a common servicegroup. My current line of thinking leads me to
believe it's got something to do with this. With a particular test scenario I
created where I create a new host, but exclude it from the hostgroup
definitions and instead manually create service definitions for this host (I
know this "one more host" is right on the cusp of this problem), I find that
when I add it so the 4,331st service gets added to the servicegroup, the
problem starts. If I remove that from that host's service definition all the
other hosts' services recover. However, based on this thinking, if I just
comment out the servicegroup add from the service template these hosts use, the
problem should stop - it doesn't.
- the only affect services are on all of the hostgroup I'm changing.
Other unrelated hosts and services are unaffected. There are 3 hostgroups:
Production Appname Hosts 1, Production Appname Hosts 2, and All Appname Hosts
which is obviously a combination of the two. All Appname Hosts is around 324
hosts.
I'm not really sure what to try at this point. It does seem like I've hit some
kind of internal limitation with Nagios, but I don't know how to determine
anything else about it beyond this. If I were able to completely isolate this
to say, not adding anything to a single servicegroup, I could avoid that and
continue adding hosts as we need it, but I have so far not been able to find
such a workaround. If there is a limitation like this, it would of course, be
nice for the pre-flight check to tell me that I can't have more than X members
of a servicegroup or something.
Other info:
Nagios version: Nagios 3.3.1 with locking patches
Merlin backend: 1.1.13+ (production), 1.1.14 (test)
MK-Livestatus module 1.1.12p6 installed (uninstalled doesn't impact)
OS: SLES 11.1 Linux, 64-bit
Memory: 12GB
CPU: 2x 2.4Ghz quad-core Xeon
What can I do?
Thanks
Mark
------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting
any issue.
::: Messages without supporting info will risk being sent to /dev/null