Re: [lopsa-discuss] Monitoring Sucks!

david Mon, 25 Jul 2011 21:01:19 -0700

Ok, it was a meeting on IRC

even worse as it means that the only people who can participate are theones who can arrange to be available at the right time.


David Lang

On Mon, 25 Jul 2011, [email protected] wrote:

Date: Mon, 25 Jul 2011 20:56:51 -0700 (PDT)
From: [email protected]
To: Christopher R Webber <[email protected]>
Cc: "[email protected]" <[email protected]>
Subject: Re: [lopsa-discuss] Monitoring Sucks!
sorry, going and visiting various blogs/forums on a regular basis toparticipate in discussions there just isn't practical for me (there are toomany blogs/forums I want to follow now that I can't keep up with.
I'll take a look and see if I can participate via e-mail (something very fewforum packages support), but otherwise web forums are just too cumbersom todeal with.
David Lang

On Mon, 25 Jul 2011, Christopher R Webber wrote:
Date: Mon, 25 Jul 2011 04:23:08 +0000
From: Christopher R Webber <[email protected]>
To: "[email protected]" <[email protected]>
Subject: Re: [lopsa-discuss] Monitoring Sucks!
Really, this is why people should be participating in the #monitoringsucksdiscussion. The goal is to work as a community to come up with a fewstandard ideas that we can all build on. Many of us only have need forparts of the stack, others need a stack to do very different things. If wecan work together to come up with how these things come together, we canall start contributing to the solution instead of bitching about how muchthe state of #monitoringsucks.
-- cwebber

Christopher Webber
Computing Infrastructure and Security
University of California, Riverside


On Jul 24, 2011, at 4:34 PM, <[email protected]>
<[email protected]> wrote:
On Fri, 22 Jul 2011, Tom Limoncelli wrote:
Part of the problem is that there are four ponies here not one.


 - Historical monitoring: Gathering statistics via SNMP or similar,
 storing them, and drawing pretty graphs.
 - Real-time monitoring: ping and other "is it up/down?" queries.

These two things are so different that I rarely see software that can do
both very well. Real-time should keep the last n-minutes of results inRAM
for fast calculations.  Historical monitoring should stash things on disk
and move on.

There are at least two more components:
- Alerting: Say you know something is "wrong", the alerting system hastodecide who to contact (based on a pager rotation schedule, etc.) and howto
 contact them (email or pager depending on ToD, urgency, and so on), and
 implements the escalation policy.
Alerting is made even more complex by the fact that you really want to beable to alert on things that your applications and systems log, not juston what your monitoring probes return.
logging and alerting really do overlap a lot, but I don't know any toolsthat take advantage of this rather than trying to partition it.
I've come to the conclusion that the best way to do alerting is to get allthe logs into a syslog stream to a central server farm and have analerting engine watch that (simple event correlator is a good startingpoint).
the monitoring system should look for things and generate log entries topass on to the alerting system.
trying to do everything in one system will run you into a lot of problems.
 - Graphing/dashboard: The system that draws the dashboards and pretty
 graphs mentioned above.
It would be nice if we had well-defined interfaces between thesecomponents
so that we could mix and match.
and I think this is the key to it all.
right now in my company we have the situation "you aren't in themonitoring group, so your opinion doesn't matter. Besides, we've justspent $big_bucks to buy $professional_tool, that will solve all monitoringissues", but If I was able to work on this, I would do something along thefollowing lines
note that when I say 'system' this could be a process, a server, or a farmof servers depending on your scale
setup one system with high performance disks running the MRTG networkservice
setup a second system with something like Nagios recieving passive checks,but modify the passive check receiver to push a copy of it's data intoMRTG. When MRTG sees something 'interesting', log it.
setup a third system with something like SEC to watch logs (both fromNagios and from other log data) to do the alerting
setup a fourth system with something along the lines of splunk for ad-hocqueries of the logs
setup a fifth system to generate periodic reports from the data

setup a sixth system to generate real-time dashboards from the data
Nagios would do the up/down checks, do dependancy resolution, etc (so thatone router going down doesn't generate 1000 alerts from all the serviceson all the servers on the other side of the router, although it may bethat that belongs in the alerting engine stage of things
David Lang
Tom

P.S.  Has anyone tried http://opentsdb.net/ ?  It looks very interesting.
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/

_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/

Re: [lopsa-discuss] Monitoring Sucks!

Reply via email to