Thanks for all the interest and quick replies here.  A bit more about
what we're trying to do...

Remote site stats:
- There are going to be roughly 1200 remote locations (which could
grow).
- Each location has a pair of servers in it.
- We have 2 servers in the remote site as we're doing a
high-availability & failover model.
- We're running mon on both servers in the remote site as each server is
setup to watch key services on itself and the other server.  If a
failure in a critical service/software component is detected, and the
server is the primary one, mon will send an alert to that server to kill
itself and the second one will kick in.
- Each server's Mon instance is watching about 25 things & checking each
one every minute.
- Every time any check is run on either mon server, we're redistributing
it back to the corporate monitoring server.
- We need to tune the failover model to detect things quicker, so about
15 of the traps need to be run every 10 seconds.
- ((15 traps * 6 traps/minute) + 10 other traps/minute) = (100 traps per
server per minute * 2 servers / location) = 200 traps per minute sent to
the corporate monitoring server

Corporate site stats:
- Failures are not expected very often - maybe 1-2 server problems per
day
- In the case of a large network outage, we might see up to a couple of
hundred servers go away.  That would actually reduce load on the server
as we won't be seeing traps from those locations.
- There shouldn't be a lot of load generated from the corporate server
needing to do something about alerts.  If a failure is detected for any
monitor for any site, we're going to send an email once every 4 hours.  
 
Load thoughts:
- From a network perspective, load would be somewhat significant (200
bytes per trap * 200 traps/minute * 1200 sites * 8 bits/byte / 60
seconds) = 6.4 Mbps /second.
- The server would be responding to (200 traps/min * 1200 sites / 60
seconds) = 4000 traps/second.  That sounds like a whole lot to me.

Corporate requirements:
 - Be aware of some service failed out in the remote site
 - Be aware of a network outage (i.e., no traps received w/in "x"
minutes)
 - A centralized view of uptime, what's currently down & outage history
 - Outage history has to be down to the "each service / server" level 



Thanks,
Tim

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf Of Jim Trocki
Sent: Thursday, August 24, 2006 9:15 AM
To: David Nolan
Cc: mon@linux.kernel.org
Subject: Re: Question on Redistribute

On Thu, 24 Aug 2006, David Nolan wrote:

> --On Thursday, August 24, 2006 08:21:16 -0500 Tim Carr
<[EMAIL PROTECTED]>
>>
>> The problem is that we're going to need to turn the monitoring period
>> for several of the remote site monitors in each location way up -
like
>> checking every 10 seconds (i.e., "interval 10s").  That mean we're
going
>> to see a huge increase in the number of traps we're seeing at the
>> corporate site.

> Or we could implement a redistributeevery option, similar to
alertevery.
> That wouldn't be too hard, but would take a little work.

yeah the issue here is the processing and communication overhead of
dealing
with the traps sent remotely. it would make sense to batch up the 10s
traps
from the remote systems and send them out in a bundle say, once every
minute,
and that would, you know, save you 6x the processing overhead on the
remote mon
server, or at least give you a way to control the processing overhead to
suit
your needs.

this use case might mean that it would make sense to move the remote
trap stuff
into the mon server itself, rather than implement it with the trap
alert. the
trap alert is a nice simple abstraction that works well for the simpler
cases,
and an elegant way of extending the functionality of mon without having
to
change the server code, but at the cost of efficiency. you would really
want
the ability to batch up only the trap transmissions rather than all
alerts.
for example, schedule a "trap queue" flush every minute performed by the
mon
server rather than in the trap alert.

then this brings up the issue of trap processing overhead on the rx end.
i
wonder if the behavior would be acceptable by just processing the trap
receptions serially, the way it is done now, or if it would require a
change in
processing method to scale it up efficiently.

this probably requires much more thought and a better understanding of
the
usage scenario.

_______________________________________________
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon

_______________________________________________
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon

Reply via email to