$work uses an interesting paradigm that I didn't really understand when I got here, but which I now love.

Quick glossary: I'll say "performance monitoring" is collecting data over time, and optionally making pretty graphs; "alerting" is noting some state being unsatisfactory and telling people about it.

We don't have our alerting system actually run any commands on remote servers itself. Rather, if there's something we want to alert on, we write a script or other component that will run on the machine, and cause it to report data about that something to the performance monitoring system. Then, we close the loop by defining our alerting checks as "tell somebody if this graph goes to zero" (or whatever conditions you can dream up).

This has several benefits:
* Security is much simpler; we don't have to give any external service the permissions to run (often root-needing) checks on our servers. * Networking is similarly much simpler; we don't have to permit incoming connections from the alerting service. * Overhead on the alerting hosts is lower, since all they're doing is (essentially) looking at graphs, rather than running commands which may block or timeout. * We get detailed performance data about whatever we're monitoring, which can help guide the alerting thresholds in the future.

I was very resistant to this idea when I first arrived, since it was unlike anything I'd ever seen before. There are still some downsides, particularly a latency around realtime alerting (since metrics have to go through the performance monitoring system first), but I think the upsides outweigh them by a large margin. Riemann seems to be a similar concept, although in our case we have our performance monitoring system sitting in the middle where Riemann places itself.

- Adam Compton



On 1/13/14 8:04 AM, Matthew Barr wrote:
So, i’ve recently been reading up on the #monitoringsucks tags, their 
responses, and some of the various things that have come out of it.
I’m in a new shop, AWS based, so may of the old standbys aren’t quite as much 
of a obvious call anymore.

What I’m now trying to figure out is what I’m missing, or would lose, by going 
with a newer paradigm for monitoring.


Anyone using Riemann yet?   Do you still use nagios / sensu / etc?

  — Basically, Riemann operates on a stream of metrics, vs relying on a a check 
every X min.

I’m trying to determine what I’ve lost by not implementing a nagios style system, 
to basically cron checks.   (the alerting & state stuff I’m pretty confidant 
I’m not loosing.)


For example: I had initially thought I’d lose a check of the web site every X 
min, but the load balancer does that anyways, and that triggers log and metrics 
about page speed return.

I think that as you scale, you start getting even more data & metrics, and the 
need for manual injection of jobs becomes smaller.


I’m curious about peoples thoughts on this…


Matthew
[email protected]

_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
  http://lopsa.org/

_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/

Reply via email to