On Aug 20, 2006, at 8:49 AM, Patrick Mannion wrote: > The environment is large - 10,000 Windows servers and 6,000 > Linux/Solaris/Tru64 servers (and a dozen VMS boxes) - a total of > 120,000 > managed objects in all, from CPUs to processes to filesystems and > services, located around the world in seven main locations with > connections from dark fiber to 256k leased lines.
At that size, it must surely be tempting to just purchase some other company's toolset, that has well-understood requirements for that size. At least, that's the sort of decision many companies make, which probably explains Tivloi :-) > I know that will mean > a distributed Nagios architecture, but I'm not sure just how it should > be done. Well, think about the rate of alerts for a start. With that number of objects, what alert rate do you expect in normal operations? If you can't pump out alert messages fast enough, it's pointless monitoring them. Also, at that size, a metric telling you how many filesystems are over threshold is useless. Correlation is extremely important instead. What you need to be doing is concentrating on business impact, which means that your monitoring zones should be designed around discrete per-project boundaries (well, as discrete as possible), and each one is capable of presenting an overview to the next layer up. For example, if I'm wanting to know "is Oracle up" I don't need to know if one of a RAID set's disks is down. However, the hardware people need to know about the disks; they're not so concerned with the applications. Split up your environment along project/responsibility lines, into as many small chunks as possible, and for each one have a monitoring solution. Instead of one Nagios with 120000 objects, you'll hopefully end up with a hundred-odd Nagios installs with ~1000 objects in each. Each business unit can look at their own local/dedicated view, and can provide a calculated "business unit status" view to a central overview Nagios. As far as Xen or VMware is concerned, do whatever you need in order to make your monitoring as available as possible; if the motoring does down, so does your knowledge of your own operations. A Xen object can be migrated between hardware platforms without significant downtime IIRC; I expect that VMware ESX can do the same. This allows you to keep the monitors running while doing essential maintenance on their hardware; essential unless you have big iron (by your description you don't). -jim ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nagios-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
