Re: [lopsa-discuss] Metrics vs Monitoring...

Adam Compton Mon, 13 Jan 2014 14:17:36 -0800

$work uses an interesting paradigm that I didn't really understand whenI got here, but which I now love.

Quick glossary: I'll say "performance monitoring" is collecting dataover time, and optionally making pretty graphs; "alerting" is notingsome state being unsatisfactory and telling people about it.

We don't have our alerting system actually run any commands on remoteservers itself. Rather, if there's something we want to alert on, wewrite a script or other component that will run on the machine, andcause it to report data about that something to the performancemonitoring system. Then, we close the loop by defining our alertingchecks as "tell somebody if this graph goes to zero" (or whateverconditions you can dream up).


This has several benefits:

* Security is much simpler; we don't have to give any external servicethe permissions to run (often root-needing) checks on our servers.* Networking is similarly much simpler; we don't have to permit incomingconnections from the alerting service.* Overhead on the alerting hosts is lower, since all they're doing is(essentially) looking at graphs, rather than running commands which mayblock or timeout.* We get detailed performance data about whatever we're monitoring,which can help guide the alerting thresholds in the future.

I was very resistant to this idea when I first arrived, since it wasunlike anything I'd ever seen before. There are still some downsides,particularly a latency around realtime alerting (since metrics have togo through the performance monitoring system first), but I think theupsides outweigh them by a large margin. Riemann seems to be a similarconcept, although in our case we have our performance monitoring systemsitting in the middle where Riemann places itself.


- Adam Compton



On 1/13/14 8:04 AM, Matthew Barr wrote:

So, i’ve recently been reading up on the #monitoringsucks tags, their 
responses, and some of the various things that have come out of it.
I’m in a new shop, AWS based, so may of the old standbys aren’t quite as much 
of a obvious call anymore.

What I’m now trying to figure out is what I’m missing, or would lose, by going 
with a newer paradigm for monitoring.


Anyone using Riemann yet?   Do you still use nagios / sensu / etc?

  — Basically, Riemann operates on a stream of metrics, vs relying on a a check 
every X min.

I’m trying to determine what I’ve lost by not implementing a nagios style system, 
to basically cron checks.   (the alerting & state stuff I’m pretty confidant 
I’m not loosing.)


For example: I had initially thought I’d lose a check of the web site every X 
min, but the load balancer does that anyways, and that triggers log and metrics 
about page speed return.

I think that as you scale, you start getting even more data & metrics, and the 
need for manual injection of jobs becomes smaller.


I’m curious about peoples thoughts on this…


Matthew
[email protected]

_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
  http://lopsa.org/


_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/

Re: [lopsa-discuss] Metrics vs Monitoring...

Reply via email to