On Thu, 10 Oct 2013, Denis Simonet wrote:

> We now updated the server to Wheezy and installed the icinga-backport,
> which is version 1.9.3.
>
> Unfortunately the problem wasn't solved with this update. When reloading
> the Daemon some passive services get red but recover after a few minutes
> (some of them reach hard state before this happens). A restart is much
> worse, almost all of the passive services become red.

    Are you performing "freshness checks" on your passive services, and
if so, what are your thresholds, and does a "stale" indication cause
an "unknown" or a "critical" state?

> Some facts which might be of interest:
> - We disabled performance data processing which improved but didn't
> solve the situation
> - The external commands check interval is set to -1,
> external_command_buffer_slots to 32768

    The number of buffer slots seems to be really high.  If you run
${ICINGA_BIN}/icingastats what does the "Used/High/Total Command
Buffers" output indicate?  Too, since the interval setting of -1
is "magic" meaning "check as often as possible" you may be spinning
the CPU needlessly looking at the queue.  Try setting it to 5 seconds
to see what happens.  (Note that this interacts with the buffer_slots
setting so watch the two as a pair.)

> - ido2db runs and is configured using the new module objects, it isn't
> enabled in icinga.cfg (as written in the comments)

    Good.  Loading it twice could well cause problems.

> - /var/lib/icinga/spool/checkresults is tmpfs
> - On a restart it takes up to half an hour until all services become
> green again
> - There are 416 hosts and 2363 Services, more than half of them passive

    That's not a particularly large environment.  I used to run a
distributed system that had a little over 1,000 hosts and upwards of
10,000 "service" data-points and was doing so on iron fairly similar
to yours.

> - When the "red phase" happens, we see in the logs that the external
> commands are read but there are no entries that these comamnds are
> translated to passive checks (which indicats that Icinga somehow can't
> handle the load of external commands)
> - The load (8 cores machine, 12GB ram) increases to 1.8 when it happens

    That sounds suspiciously like single-threading behaviour.  What
processes are running flat-out during the anomalous period?  (A load
average of 1.8 on an 8-core system is close to nil compared to what
the box can handle, and 12 gigs of mainstore is a *lot* unless
you're using VMs aggressively.)

> We also read the tuning page
> (http://docs.icinga.org/latest/en/tuning.html) but didn't find a way to
> solve our issue with it.
>
> Do you have an idea what could be the bottleneck?

    For one, can you reproduce the thing in a lab setting so you don't
have to fiddle with your production instance?  That'd take a lot of
the pressure off during the troubleshooting phase as you'll be
restarting the instance fairly frequently to deliberately perturb
the behaviour and you don't want that in production.

    The methodology I'd use here is to find out what's soaking single
cores during the time when the system is misbehaving and try to
adjust things that could influence overall system behaviour.  In
particular, it'd be interesting to see if the problem is IDO or not,
and that's easily determined by not loading the idomod module at
startup time and seeing how the system behaves.

    I'm guessing that you probably followed -- or at least excluded
from consideration -- the advice in the tuning document, but I would
absolutely encourage the use of PNP4Nagios as a troubleshooting tool
in this.  It'll graph the outputs from "icingastats" over time and
make them easier to interpret than comparing printouts.  You'll
find that tool especially useful in tuning the number of buffer slots
and command-file check frequency.

    The new version of ido2db, and it's FIFO, cures a lot of grief,
but it may back up if the FIFO gets huge by virtue of a slow RDBMS
engine.  Is the database instance "yours" or under somebody else's
control?  Is it shared with other applications?  (Note that this
will be immediately apparent if you shut down idomod altogether and
the system (using "Classic" web) begins to behave normally.)

    Cheers!

+------------------------------------------------+---------------------+
| Carl Richard Friend (UNIX Sysadmin)            | West Boylston       |
| Minicomputer Collector / Enthusiast            | Massachusetts, USA  |
| mailto:crfri...@rcn.com                        +---------------------+
| http://users.rcn.com/crfriend/museum           | ICBM: 42:22N 71:47W |
+------------------------------------------------+---------------------+

------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60134071&iu=/4140/ostg.clktrk
_______________________________________________
icinga-users mailing list
icinga-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/icinga-users

Reply via email to