[lopsa-discuss] Monitoring systems for cloud nodes

Morgan Blackthorne Sun, 24 Mar 2013 00:09:12 -0700

This is a spin-off question related to the other monitoring system thread
we have going, taking it from the general direction towards a specific
use-case scenario.


I've used several systems throughout the years, the last two notably being
Nagios and Zabbix. Nagios seems better suited for monitoring, while Zabbix
is clearly superior in terms of graphing. Configuring NagiosGraph is...
more difficult than it should be, IMO. The Zabbix agent seemed to be less
reliable than NRPE, however, and last I worked with Zabbix it seemed to
default to not alerting unless explicitly configured to do so. (It's been a
while since we moved away from it, so my memory is a bit foggy. Near as I
can recall, a configured alarm via a Zabbix agent check would not fire if
the agent itself was not reachable, and the system did not natively support
the concept of a "host down" alert in that situation, either. You had to
manually configure a check of the network interfaces and the agent itself,
which seemed very counter-intuitive, and let to many situations where we
hadn't properly thought through all failure scenarios to configure the
alarms explicitly enough. All that said, I know some of the issues we had
and raised with Zabbix were marked as pending the 2.x branch, which is out
now-- I'm not sure if they've been resolved or if the framework to resolve
them is now in place.)

However, I'm specifically curious to see what people are using for
environments where the hosts can be spun up and down outside the control of
the normal provisioning channels. I know that there's been significant work
done lately by the Opscode folks to configure Nagios dynamically via Chef,
which is something I've got on my to-research list when I get beyond the
ops programming tasks on my plate right now. I believe the downside of that
would be whatever the interval is between a node being terminated and the
configuration being regenerated. I know that Zabbix also supports the idea
of dynamic node registration, which seems very applicable in this case, but
again, I'm not sure if it's got some kind of pruning capability in place.

I'm also curious to know along these lines if anyone has worked with a
system (either native or with a connector) that will take advantage of
Amazon's CloudWatch metrics. I could certainly monitor things like CPU and
network utilization myself, but if AWS is already doing so, polling their
data seems like it would be easier. (Potentially cleaner? I'm undecided on
that, since it seems like it could introduce another dependency-- yet I've
never seen CloudWatch unavailable when the core EC2 services were working.
However, I may not have explored it in enough detail to see that kind of
failure, so... I remain undecided.) One of the upsides of integrating with
CloudWatch is that I can monitor the same metrics that autoscaling is
operating on, and I believe actually retrieve those thresholds as well,
rather than needing to configure them by hand (or by role in Chef, but that
would still need to be manually updated if I changed the autoscaling
parameters).

Thanks for any thoughts. :)

--
~*~ StormeRider ~*~

"Every world needs its heroes [...] They inspire us to be better than we
are. And they protect from the darkness that's just around the corner."

(from Smallville Season 6x1: "Zod")

On why I hate the phrase "that's so lame"... http://bit.ly/Ps3uSS

_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

[lopsa-discuss] Monitoring systems for cloud nodes

Reply via email to