Re: [lopsa-discuss] Monitoring Sucks!

Christophe Kalt Thu, 04 Aug 2011 17:26:32 -0700

Tom Limoncelli wrote:

> Part of the problem is that there are four ponies here not one.
>


four is an optimistic count..

>
>    - Historical monitoring: Gathering statistics via SNMP or similar,
>    storing them, and drawing pretty graphs.
>    - Real-time monitoring: ping and other "is it up/down?" queries.
>
> These two things are so different that I rarely see software that can do
> both very well.
>

But they aren't really.  stats/metrics is what should drive the real-time
monitoring, it's far more interesting and rich than simple boolean checks
(although there are always some of these as well).
Now, I agree that the real-time analysis of these metrics and long term
storage are two different things.


>  Real-time should keep the last n-minutes of results in RAM for
> fast calculations.  Historical monitoring should stash things on disk and
> move on.
>

So, may be you agree and overstated how different these really are.

Real-time monitoring also needs to include a capacity management component,
so you can tell before the end of a cycle (whatever that is for you) whether
you're going to run out of capacity.


> There are at least two more components:
>
>    - Alerting: Say you know something is "wrong", the alerting system has
>    to decide who to contact (based on a pager rotation schedule, etc.) and how
>    to contact them (email or pager depending on ToD, urgency, and so on), and
>    implements the escalation policy.
>
> woa woa woa..
That's just one of many ways of doing "alerting", dashboards are commonly
used.

But really, everyone's missed the most important thing about monitoring.
Say you have got the system of your dreams setup and working.  That's great,
but useless.
What matters most is the workflow surrounding alerting/alerts, how you deal
with alerts that are actionable, those that aren't, the ones that are long
term issues, etc etc..
Integration with outage tracking systems, ticketing systems, hardware &
software provisioning, software development processes and so on matters a
whole lot more when it comes to the long term success of a monitoring
implementation.

>
>    - Graphing/dashboard: The system that draws the dashboards and pretty
>    graphs mentioned above.
>
> That's nice, but only useful in a manual way of doing things.
What you want in addition are things like

   - trending analysis, and reporting
   - rich logging infrastructure (not for real-time monitoring/alerting, but
   as supplement)
   - lots of people are fond of (re)active monitoring, where certain alerts
   trigger automated actions.  (yikes!  i've yet to see a good use case for
   this.)

It would be nice if we had well-defined interfaces between these components
> so that we could mix and match.
>

It sounds nice, but my guess is that such an approach would be overly
complex and ultimately doomed.

There are some hard design choices which I suspect need to be made early
based on key requirements which will vary from one environment to another.
And that hardest of it all, will whatever you build scale?

Finally, to all those clamoring for a dream solution..  The hardest part
isn't the monitoring infrastructure itself, but integrating it in your
environment.  That is extremely time-consuming/expensive and requires a huge
commitment from the business.

_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Re: [lopsa-discuss] Monitoring Sucks!

Reply via email to