I have ~25 computers worldwide, and want to run several test commands on them every x minutes (where x is configurable), report the results, and alert me when "bad things" happen.
Is this what Nagios/OpenNMS is designed for? Specifics below, but are Nagios/OpenNMS designed for network testing or more general multi system status monitoring/reporting/alerting? More specifically: % Confirm each machine is up/pingable/reachable [obviously!] % nmap each machine to make sure correct ports (varies by machine) and no others are open % "Egress" testing: machines can/can't reach port 25/80/443/etc (my choice for each machine) on public machines like www.yahoo.com, smtp.yahoo.com, etc. % Config changes: not just hacking/cracking testing, but maybe I intentionally upgraded something, and accidentally broke email between two machines or broke Mailman auto-reply or something like that. % Not all tests all the time: some tests should run less frequently (reduce the load); ideally, some "silly" tests run "randomly" to check on things I'm 100% positive sure will work all the time (but really fail sometimes-- eg, loopback interface broken) % For machines running httpd, download several pages, diff to last copies of these pages, report "big" differences (I assume small diffs are changes, but big diffs may be hacking/defacing/config error/etc); for "status update" pages (like mrtg), pages *should* be changing frequently, otherwise something is wrong % For machines running sendmail, send a test email to one of the other machines running sendmail, which then confirms receipt; alert if not received. Also do other mail routing/delivery tests. % For machines running MySQL, run specific queries that are expected to return certain results (two basic types of queries: some that should NEVER return ANY results, others that should ALWAYS return AT LEAST ONE result) % For machines running popd/imapd, simulate login to confirm authentication is working (popd/imapd auth isn't always local for us) % For machines running bind/tinydns, DNS testing: responding w/ correct IP addresses, ACL control re who can lookup which hosts? (internal machines can look up yahoo.com, external machines can look up *.mycompany.com only) % For machines running other daemons, do daemon specific testing/verification % Backup testing: our backup server should have fairly recent copies of the files it's backing up from other machines % Replication testing: confirm that a config file/database/software version/etc is the same on two given machines % Monitor files in /etc (eg, passwd, shadow, crontab) for changes. % Ping the other 24 machines + alert me if ping fails or is very slow % Firewall rule testing: test which machines can reach which other machines on what ports and compare to known good list/table % Cross-report status of each machine to each other machine, even if nothing is wrong (so each machine knows how the other 24 are doing) % Run things like "df -k/df -ik", "ps -aux -www", "top -n -d 1 infinity", "netstat -a", "vmstat", "mailq -v", "uptime", etc, and find memory-hogging/CPU-intensive processes, non-daemon processes that have been running for a long time, processes running w/o proper subprocesses, non-listening daemon processes (like ntpd) not running, near-full disk partitions, DOS attacks, many sockets in FIN_WAIT_[12] state, overfull mail queues, recent reboots, etc. % Allow for special cases: run specific tests (eg, a Perl script) on only 1-2 of the 25 machines % Windows machines: ideally, run the equivalent of the commands above and also report failed scheduled tasks, near-full Exchange stores, and other Windows-specific issues % Ideally, a lot of the above tests should run "out of the box" or Nagios/OpenNMS could run some sort of "discovery" program (find out which machines are running httpd and grab a few pages linked off the home page + use those as the "test" pages from then on), and allow me to customize as necessary. Of course, I realize I'll have to config things like custom SQL queries. % Ideally, the testing should be "decentralized": any of the machines can test any of the others, and the results are stored in a distributed/mirrored way. However, the testing management is ideally "centralized" in the sense that I can control testing on all machines from a given machine. % Ideally, the results (good or bad) can be displayed in a web page so my customers can see that my machines are being tested regularly, and are up and running fine as of x minutes ago. % Ideally, the "something bad has happened" reporting can be configured-- it may be OK for "mailq -v" to be large for 10-15 minutes, but not for 30 minutes (for example). % Ideally, software-specific "regression" testing. EG, when I upgrade Mailman, sendmail, etc, Nagios/OpenNMS could run a set of tests to make sure I didn't break something horribly -- We're just a Bunch Of Regular Guys, a collective group that's trying to understand and assimilate technology. We feel that resistance to new ideas and technology is unwise and ultimately futile. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null