Hi, I've been asked this a few times so I'll respond to the mailing list instead of to indivual people.
The questions are typically * What machine do you run nagios on ? * Any hints about monitoring this many machines. So .. What machine do you run nagios on? ========================== We use a dual proc (single cores) 2.8GHz Xeon IBM Blade server with 4GB of RAM (I thought it was 6GB - sorry about that Patrick) for the renderwall monitoring. Typically though we only use about 3GB of the 4GB. Reloading or restarting nagios takes about 5 minutes with 1500 hosts, 9000 services (approximatey). During which time you can see a single CPU at 100%. We run debian gnu linux on this machine with a 2.6.15 kernel. I tried the 2.4 kernel but got rid of it early on as part of debugging it. I'm currently running nagios 2.3. All services except the host pings are checked using passive checks. We check on various Weta specific things related to rendering on each machine. It's all perl run from cron. Once nagios is up and running It is rare to see more than one CPU being used. In other words the bladeserver is not sweating under the load. That CPU will hit 100% about 1/4 of the time. Any hints about monitoring that many machines? =================================== Do not process perfdata. -------------------------------------------- This was a killer. I discovered to my surprise that nagios will run /usr/bin/printf to record performance data. This meant that every time a passive check sent a result back to nagios it would start a subshell and run /usr/bin/printf to record some data. This is a killer with 9000 services being checked every 5 minutes. If I needed the performance data I think I would hack nagios to recognise /usr/bin/printf in a checkcommand and replace it with the library call - obviously this would have to be done carefully to handle various shell things (like >> etc). Spread out your passive checks. ----------------------------------------------------------- We run our passive checks every five minutes. Perhaps a little aggressive. We had to spread out the passive check start times using a random sleep before starting them. At the moment the passive check scripts will sleep between 0 and 3 minutes before starting up. This spreads the load nicely. Adding more hosts will just be a matter of spreading these out more until I have to change checking every 10 minutes instead of every 5 minutes. For now though that's not necessary. Caveats ======= We do no notification from nagios for the renderwall. The wranglers use the web pages to monitor the machines. (Obviously we notify for our production nagios instance but that's a much smaller problem). Hope this helps, --------- Bill Ryder System Engineer Weta Digital. ------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642 _______________________________________________ Nagios-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
