Re: [lopsa-discuss] Metrics vs Monitoring...

Alan Robertson Mon, 13 Jan 2014 14:53:05 -0800

FWIW - the Assimilation architecture is quite similar to the "alerting"portion of your system.

We run all checks locally, and the local systems report on exceptions bymaking an outbound connection when needed.

We notice death of servers in a different way - each server sendsheartbeats to two other servers - so each server has two neighbors thatare watching it, and are watched by it. When heartbeats go away, wemake an outbound connection to report death of server.


As a result, the central server "normally" does nothing.

We don't yet collect time series data.

On 1/13/2014 3:16 PM, Adam Compton wrote:

$work uses an interesting paradigm that I didn't really understandwhen I got here, but which I now love.
Quick glossary: I'll say "performance monitoring" is collecting dataover time, and optionally making pretty graphs; "alerting" is notingsome state being unsatisfactory and telling people about it.
We don't have our alerting system actually run any commands on remoteservers itself. Rather, if there's something we want to alert on, wewrite a script or other component that will run on the machine, andcause it to report data about that something to the performancemonitoring system. Then, we close the loop by defining our alertingchecks as "tell somebody if this graph goes to zero" (or whateverconditions you can dream up).
This has several benefits:
* Security is much simpler; we don't have to give any external servicethe permissions to run (often root-needing) checks on our servers.* Networking is similarly much simpler; we don't have to permitincoming connections from the alerting service.* Overhead on the alerting hosts is lower, since all they're doing is(essentially) looking at graphs, rather than running commands whichmay block or timeout.* We get detailed performance data about whatever we're monitoring,which can help guide the alerting thresholds in the future.
I was very resistant to this idea when I first arrived, since it wasunlike anything I'd ever seen before. There are still some downsides,particularly a latency around realtime alerting (since metrics have togo through the performance monitoring system first), but I think theupsides outweigh them by a large margin. Riemann seems to be a similarconcept, although in our case we have our performance monitoringsystem sitting in the middle where Riemann places itself.
- Adam Compton



On 1/13/14 8:04 AM, Matthew Barr wrote:
So, i’ve recently been reading up on the #monitoringsucks tags, theirresponses, and some of the various things that have come out of it.I’m in a new shop, AWS based, so may of the old standbys aren’t quiteas much of a obvious call anymore.
What I’m now trying to figure out is what I’m missing, or would lose,by going with a newer paradigm for monitoring.
Anyone using Riemann yet?   Do you still use nagios / sensu / etc?
— Basically, Riemann operates on a stream of metrics, vs relying ona a check every X min.
I’m trying to determine what I’ve lost by not implementing a nagiosstyle system, to basically cron checks. (the alerting & state stuffI’m pretty confidant I’m not loosing.)
For example: I had initially thought I’d lose a check of the web siteevery X min, but the load balancer does that anyways, and thattriggers log and metrics about page speed return.
I think that as you scale, you start getting even more data &metrics, and the need for manual injection of jobs becomes smaller.
I’m curious about peoples thoughts on this…


Matthew
[email protected]

_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
  http://lopsa.org/
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/

_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/

Re: [lopsa-discuss] Metrics vs Monitoring...

Reply via email to