On Thu, 10 Oct 2013, Denis Simonet wrote: > We now updated the server to Wheezy and installed the icinga-backport, > which is version 1.9.3. > > Unfortunately the problem wasn't solved with this update. When reloading > the Daemon some passive services get red but recover after a few minutes > (some of them reach hard state before this happens). A restart is much > worse, almost all of the passive services become red.
Are you performing "freshness checks" on your passive services, and if so, what are your thresholds, and does a "stale" indication cause an "unknown" or a "critical" state? > Some facts which might be of interest: > - We disabled performance data processing which improved but didn't > solve the situation > - The external commands check interval is set to -1, > external_command_buffer_slots to 32768 The number of buffer slots seems to be really high. If you run ${ICINGA_BIN}/icingastats what does the "Used/High/Total Command Buffers" output indicate? Too, since the interval setting of -1 is "magic" meaning "check as often as possible" you may be spinning the CPU needlessly looking at the queue. Try setting it to 5 seconds to see what happens. (Note that this interacts with the buffer_slots setting so watch the two as a pair.) > - ido2db runs and is configured using the new module objects, it isn't > enabled in icinga.cfg (as written in the comments) Good. Loading it twice could well cause problems. > - /var/lib/icinga/spool/checkresults is tmpfs > - On a restart it takes up to half an hour until all services become > green again > - There are 416 hosts and 2363 Services, more than half of them passive That's not a particularly large environment. I used to run a distributed system that had a little over 1,000 hosts and upwards of 10,000 "service" data-points and was doing so on iron fairly similar to yours. > - When the "red phase" happens, we see in the logs that the external > commands are read but there are no entries that these comamnds are > translated to passive checks (which indicats that Icinga somehow can't > handle the load of external commands) > - The load (8 cores machine, 12GB ram) increases to 1.8 when it happens That sounds suspiciously like single-threading behaviour. What processes are running flat-out during the anomalous period? (A load average of 1.8 on an 8-core system is close to nil compared to what the box can handle, and 12 gigs of mainstore is a *lot* unless you're using VMs aggressively.) > We also read the tuning page > (http://docs.icinga.org/latest/en/tuning.html) but didn't find a way to > solve our issue with it. > > Do you have an idea what could be the bottleneck? For one, can you reproduce the thing in a lab setting so you don't have to fiddle with your production instance? That'd take a lot of the pressure off during the troubleshooting phase as you'll be restarting the instance fairly frequently to deliberately perturb the behaviour and you don't want that in production. The methodology I'd use here is to find out what's soaking single cores during the time when the system is misbehaving and try to adjust things that could influence overall system behaviour. In particular, it'd be interesting to see if the problem is IDO or not, and that's easily determined by not loading the idomod module at startup time and seeing how the system behaves. I'm guessing that you probably followed -- or at least excluded from consideration -- the advice in the tuning document, but I would absolutely encourage the use of PNP4Nagios as a troubleshooting tool in this. It'll graph the outputs from "icingastats" over time and make them easier to interpret than comparing printouts. You'll find that tool especially useful in tuning the number of buffer slots and command-file check frequency. The new version of ido2db, and it's FIFO, cures a lot of grief, but it may back up if the FIFO gets huge by virtue of a slow RDBMS engine. Is the database instance "yours" or under somebody else's control? Is it shared with other applications? (Note that this will be immediately apparent if you shut down idomod altogether and the system (using "Classic" web) begins to behave normally.) Cheers! +------------------------------------------------+---------------------+ | Carl Richard Friend (UNIX Sysadmin) | West Boylston | | Minicomputer Collector / Enthusiast | Massachusetts, USA | | mailto:crfri...@rcn.com +---------------------+ | http://users.rcn.com/crfriend/museum | ICBM: 42:22N 71:47W | +------------------------------------------------+---------------------+ ------------------------------------------------------------------------------ October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register > http://pubads.g.doubleclick.net/gampad/clk?id=60134071&iu=/4140/ostg.clktrk _______________________________________________ icinga-users mailing list icinga-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/icinga-users