Thanks for all the interest and quick replies here. A bit more about what we're trying to do...
Remote site stats: - There are going to be roughly 1200 remote locations (which could grow). - Each location has a pair of servers in it. - We have 2 servers in the remote site as we're doing a high-availability & failover model. - We're running mon on both servers in the remote site as each server is setup to watch key services on itself and the other server. If a failure in a critical service/software component is detected, and the server is the primary one, mon will send an alert to that server to kill itself and the second one will kick in. - Each server's Mon instance is watching about 25 things & checking each one every minute. - Every time any check is run on either mon server, we're redistributing it back to the corporate monitoring server. - We need to tune the failover model to detect things quicker, so about 15 of the traps need to be run every 10 seconds. - ((15 traps * 6 traps/minute) + 10 other traps/minute) = (100 traps per server per minute * 2 servers / location) = 200 traps per minute sent to the corporate monitoring server Corporate site stats: - Failures are not expected very often - maybe 1-2 server problems per day - In the case of a large network outage, we might see up to a couple of hundred servers go away. That would actually reduce load on the server as we won't be seeing traps from those locations. - There shouldn't be a lot of load generated from the corporate server needing to do something about alerts. If a failure is detected for any monitor for any site, we're going to send an email once every 4 hours. Load thoughts: - From a network perspective, load would be somewhat significant (200 bytes per trap * 200 traps/minute * 1200 sites * 8 bits/byte / 60 seconds) = 6.4 Mbps /second. - The server would be responding to (200 traps/min * 1200 sites / 60 seconds) = 4000 traps/second. That sounds like a whole lot to me. Corporate requirements: - Be aware of some service failed out in the remote site - Be aware of a network outage (i.e., no traps received w/in "x" minutes) - A centralized view of uptime, what's currently down & outage history - Outage history has to be down to the "each service / server" level Thanks, Tim -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jim Trocki Sent: Thursday, August 24, 2006 9:15 AM To: David Nolan Cc: mon@linux.kernel.org Subject: Re: Question on Redistribute On Thu, 24 Aug 2006, David Nolan wrote: > --On Thursday, August 24, 2006 08:21:16 -0500 Tim Carr <[EMAIL PROTECTED]> >> >> The problem is that we're going to need to turn the monitoring period >> for several of the remote site monitors in each location way up - like >> checking every 10 seconds (i.e., "interval 10s"). That mean we're going >> to see a huge increase in the number of traps we're seeing at the >> corporate site. > Or we could implement a redistributeevery option, similar to alertevery. > That wouldn't be too hard, but would take a little work. yeah the issue here is the processing and communication overhead of dealing with the traps sent remotely. it would make sense to batch up the 10s traps from the remote systems and send them out in a bundle say, once every minute, and that would, you know, save you 6x the processing overhead on the remote mon server, or at least give you a way to control the processing overhead to suit your needs. this use case might mean that it would make sense to move the remote trap stuff into the mon server itself, rather than implement it with the trap alert. the trap alert is a nice simple abstraction that works well for the simpler cases, and an elegant way of extending the functionality of mon without having to change the server code, but at the cost of efficiency. you would really want the ability to batch up only the trap transmissions rather than all alerts. for example, schedule a "trap queue" flush every minute performed by the mon server rather than in the trap alert. then this brings up the issue of trap processing overhead on the rx end. i wonder if the behavior would be acceptable by just processing the trap receptions serially, the way it is done now, or if it would require a change in processing method to scale it up efficiently. this probably requires much more thought and a better understanding of the usage scenario. _______________________________________________ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon _______________________________________________ mon mailing list mon@linux.kernel.org http://linux.kernel.org/mailman/listinfo/mon