Hi,

I'm kicking over similar thoughts now.

Metrics and log stream analysis I think go a long way, but actual
monitoring of a system in total still needs to be done.  I think right
now, for me, monitoring is moving torward an automated functional test
(hit endpoint, query data, get some expected result, etc..) and less
"is webserver up."  But, some critical components in the pipeline will
still need dedicated monitoring to make fault isolation easier and
solve questions like "if the query failed is that because the message
bus is busted, the backend service is funked up, the database is
batty, or my data is just plain fried?"  Eg, alert on  functionality
failures but have the system give me the proper red lights on systems
to point me in the right direction at response time.

Also, handling dependacies for those events + metrics alerts is still
something I'm trying to figure out in a way that avoids alert fatigue.
 Most (all?) of the metics type packages don’t have any sort of notion
of "if the core switch is unpingable, don't alert me about the 57 web
server behind it."  Flapjack might help with this, but it's totally
overkill for a lot of instances.

As far as I can tell, the ideal solution (for me) seems to be to use
nagios/sensu/whatever to aggregate functional tests, standard
healthcheck monitoring, and metric threshold analysis into a
comprehensive "view" of system state, and then alert on that.

NOTE: I don't claim the is right or sane, interested in the ensuing discussion.

Thanks!

-n

On Mon, Jan 13, 2014 at 9:04 AM, Matthew Barr <[email protected]> wrote:
> So, i’ve recently been reading up on the #monitoringsucks tags, their 
> responses, and some of the various things that have come out of it.
> I’m in a new shop, AWS based, so may of the old standbys aren’t quite as much 
> of a obvious call anymore.
>
> What I’m now trying to figure out is what I’m missing, or would lose, by 
> going with a newer paradigm for monitoring.
>
>
> Anyone using Riemann yet?   Do you still use nagios / sensu / etc?
>
>  — Basically, Riemann operates on a stream of metrics, vs relying on a a 
> check every X min.
>
> I’m trying to determine what I’ve lost by not implementing a nagios style 
> system, to basically cron checks.   (the alerting & state stuff I’m pretty 
> confidant I’m not loosing.)
>
>
> For example: I had initially thought I’d lose a check of the web site every X 
> min, but the load balancer does that anyways, and that triggers log and 
> metrics about page speed return.
>
> I think that as you scale, you start getting even more data & metrics, and 
> the need for manual injection of jobs becomes smaller.
>
>
> I’m curious about peoples thoughts on this…
>
>
> Matthew
> [email protected]
>
> _______________________________________________
> Discuss mailing list
> [email protected]
> https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
> This list provided by the League of Professional System Administrators
>  http://lopsa.org/



-- 
-------------------------------------------
nathan hruby <[email protected]>
metaphysically wrinkle-free
-------------------------------------------
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to