Dear Carl On 10.10.2013 18:02, Carl R. Friend wrote: > Are you performing "freshness checks" on your passive services, and > if so, what are your thresholds, and does a "stale" indication cause > an "unknown" or a "critical" state? Oh, I forgot about that. We had a threshold set to 0 and a stale causes a critical state - that's what i meant with red. We now set the threshold to the check interval plus 5 minutes. Now the "red phase" (passive services get critical on reload or restart) is less annoying. But this doesn't solve the problem that Icinga thinks that passive services are stale on reload or restart because it can't process the passive check results fast enough.
> The number of buffer slots seems to be really high. If you run > ${ICINGA_BIN}/icingastats what does the "Used/High/Total Command > Buffers" output indicate? Too, since the interval setting of -1 > is "magic" meaning "check as often as possible" you may be spinning > the CPU needlessly looking at the queue. Try setting it to 5 seconds > to see what happens. (Note that this interacts with the buffer_slots > setting so watch the two as a pair.) Ok, I will do so. The output indicates: 0 / 118 / 32768 - obviously the number of buffer slots really is high. > That's not a particularly large environment. I used to run a > distributed system that had a little over 1,000 hosts and upwards of > 10,000 "service" data-points and was doing so on iron fairly similar > to yours. Good to know! > That sounds suspiciously like single-threading behaviour. What > processes are running flat-out during the anomalous period? (A load > average of 1.8 on an 8-core system is close to nil compared to what > the box can handle, and 12 gigs of mainstore is a *lot* unless > you're using VMs aggressively.) Yes, on reload or restart I can see two or probably three icinga processes, not more. Yes, we also think that the system really should handle the load smoothingly - it is a dedicated machine without VMs. I will have a closer look on the processes. > For one, can you reproduce the thing in a lab setting so you don't > have to fiddle with your production instance? That'd take a lot of > the pressure off during the troubleshooting phase as you'll be > restarting the instance fairly frequently to deliberately perturb > the behaviour and you don't want that in production. We also thought about that. It is not easy do so as several "satellites" are involved. But we are pondering about possibilities to automate this so we can do stress tests. > The methodology I'd use here is to find out what's soaking single > cores during the time when the system is misbehaving and try to > adjust things that could influence overall system behaviour. In > particular, it'd be interesting to see if the problem is IDO or not, > and that's easily determined by not loading the idomod module at > startup time and seeing how the system behaves. This was my idea for the next step, too. > I'm guessing that you probably followed -- or at least excluded > from consideration -- the advice in the tuning document, but I would > absolutely encourage the use of PNP4Nagios as a troubleshooting tool > in this. It'll graph the outputs from "icingastats" over time and > make them easier to interpret than comparing printouts. You'll > find that tool especially useful in tuning the number of buffer slots > and command-file check frequency. Good idea. The thing is that we disabled the performance data processing because we thought that this could also influence the ... performance :). Is the process performance data command executed linearly together with the processing of external commands or does it happen in parallel? Actually we need to execute two things for the results, the pnp4nagios script and a php script which inserts information in to a database. We thought that probably we have to implement some kind of a fifo so the external command goes as fast as possible as it is executed for every check result. Do you think that we should do so? > The new version of ido2db, and it's FIFO, cures a lot of grief, > but it may back up if the FIFO gets huge by virtue of a slow RDBMS > engine. Is the database instance "yours" or under somebody else's > control? Is it shared with other applications? (Note that this > will be immediately apparent if you shut down idomod altogether and > the system (using "Classic" web) begins to behave normally.) The database is ours, MySQL. It runs on the 8 core and 12MB memory machine. It is shared with our web frontend to Icinga, PNP4Nagios (which currently doesn't run) and Cacti. Thank you very much for the considerations! I will get back to the list when there are some results so others can also profit from it :). Denis ------------------------------------------------------------------------------ October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register > http://pubads.g.doubleclick.net/gampad/clk?id=60134071&iu=/4140/ostg.clktrk _______________________________________________ icinga-users mailing list icinga-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/icinga-users