FWIW - the Assimilation architecture is quite similar to the "alerting"
portion of your system.
We run all checks locally, and the local systems report on exceptions by
making an outbound connection when needed.
We notice death of servers in a different way - each server sends
heartbeats to two other servers - so each server has two neighbors that
are watching it, and are watched by it. When heartbeats go away, we
make an outbound connection to report death of server.
As a result, the central server "normally" does nothing.
We don't yet collect time series data.
On 1/13/2014 3:16 PM, Adam Compton wrote:
$work uses an interesting paradigm that I didn't really understand
when I got here, but which I now love.
Quick glossary: I'll say "performance monitoring" is collecting data
over time, and optionally making pretty graphs; "alerting" is noting
some state being unsatisfactory and telling people about it.
We don't have our alerting system actually run any commands on remote
servers itself. Rather, if there's something we want to alert on, we
write a script or other component that will run on the machine, and
cause it to report data about that something to the performance
monitoring system. Then, we close the loop by defining our alerting
checks as "tell somebody if this graph goes to zero" (or whatever
conditions you can dream up).
This has several benefits:
* Security is much simpler; we don't have to give any external service
the permissions to run (often root-needing) checks on our servers.
* Networking is similarly much simpler; we don't have to permit
incoming connections from the alerting service.
* Overhead on the alerting hosts is lower, since all they're doing is
(essentially) looking at graphs, rather than running commands which
may block or timeout.
* We get detailed performance data about whatever we're monitoring,
which can help guide the alerting thresholds in the future.
I was very resistant to this idea when I first arrived, since it was
unlike anything I'd ever seen before. There are still some downsides,
particularly a latency around realtime alerting (since metrics have to
go through the performance monitoring system first), but I think the
upsides outweigh them by a large margin. Riemann seems to be a similar
concept, although in our case we have our performance monitoring
system sitting in the middle where Riemann places itself.
- Adam Compton
On 1/13/14 8:04 AM, Matthew Barr wrote:
So, i’ve recently been reading up on the #monitoringsucks tags, their
responses, and some of the various things that have come out of it.
I’m in a new shop, AWS based, so may of the old standbys aren’t quite
as much of a obvious call anymore.
What I’m now trying to figure out is what I’m missing, or would lose,
by going with a newer paradigm for monitoring.
Anyone using Riemann yet? Do you still use nagios / sensu / etc?
— Basically, Riemann operates on a stream of metrics, vs relying on
a a check every X min.
I’m trying to determine what I’ve lost by not implementing a nagios
style system, to basically cron checks. (the alerting & state stuff
I’m pretty confidant I’m not loosing.)
For example: I had initially thought I’d lose a check of the web site
every X min, but the load balancer does that anyways, and that
triggers log and metrics about page speed return.
I think that as you scale, you start getting even more data &
metrics, and the need for manual injection of jobs becomes smaller.
I’m curious about peoples thoughts on this…
Matthew
[email protected]
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/