On Fri, 22 Jul 2011, Tom Limoncelli wrote:

Part of the problem is that there are four ponies here not one.


  - Historical monitoring: Gathering statistics via SNMP or similar,
  storing them, and drawing pretty graphs.
  - Real-time monitoring: ping and other "is it up/down?" queries.

These two things are so different that I rarely see software that can do
both very well.  Real-time should keep the last n-minutes of results in RAM
for fast calculations.  Historical monitoring should stash things on disk
and move on.

There are at least two more components:

  - Alerting: Say you know something is "wrong", the alerting system has to
  decide who to contact (based on a pager rotation schedule, etc.) and how to
  contact them (email or pager depending on ToD, urgency, and so on), and
  implements the escalation policy.

Alerting is made even more complex by the fact that you really want to be able to alert on things that your applications and systems log, not just on what your monitoring probes return.

logging and alerting really do overlap a lot, but I don't know any tools that take advantage of this rather than trying to partition it.

I've come to the conclusion that the best way to do alerting is to get all the logs into a syslog stream to a central server farm and have an alerting engine watch that (simple event correlator is a good starting point).

the monitoring system should look for things and generate log entries to pass on to the alerting system.

trying to do everything in one system will run you into a lot of problems.

  - Graphing/dashboard: The system that draws the dashboards and pretty
  graphs mentioned above.

It would be nice if we had well-defined interfaces between these components
so that we could mix and match.

and I think this is the key to it all.

right now in my company we have the situation "you aren't in the monitoring group, so your opinion doesn't matter. Besides, we've just spent $big_bucks to buy $professional_tool, that will solve all monitoring issues", but If I was able to work on this, I would do something along the following lines

note that when I say 'system' this could be a process, a server, or a farm of servers depending on your scale

setup one system with high performance disks running the MRTG network service

setup a second system with something like Nagios recieving passive checks, but modify the passive check receiver to push a copy of it's data into MRTG. When MRTG sees something 'interesting', log it.

setup a third system with something like SEC to watch logs (both from Nagios and from other log data) to do the alerting

setup a fourth system with something along the lines of splunk for ad-hoc queries of the logs

setup a fifth system to generate periodic reports from the data

setup a sixth system to generate real-time dashboards from the data


Nagios would do the up/down checks, do dependancy resolution, etc (so that one router going down doesn't generate 1000 alerts from all the services on all the servers on the other side of the router, although it may be that that belongs in the alerting engine stage of things

David Lang

Tom

P.S.  Has anyone tried http://opentsdb.net/ ?  It looks very interesting.
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to