Hello,

In a way, I would like to bump this post as I am also curious to how others have deployed monitoring in dynamic cloud environments. Thank you Morgan for outlining a common problem for many.

- Ash

On 24.03.2013 03:08, Morgan Blackthorne wrote:
This is a spin-off question related to the other monitoring system
thread we have going, taking it from the general direction towards a
specific use-case scenario.

I've used several systems throughout the years, the last two notably
being Nagios and Zabbix. Nagios seems better suited for monitoring,
while Zabbix is clearly superior in terms of graphing. Configuring
NagiosGraph is... more difficult than it should be, IMO. The Zabbix
agent seemed to be less reliable than NRPE, however, and last I worked
with Zabbix it seemed to default to not alerting unless explicitly
configured to do so. (It's been a while since we moved away from it,
so my memory is a bit foggy. Near as I can recall, a configured alarm
via a Zabbix agent check would not fire if the agent itself was not
reachable, and the system did not natively support the concept of a
"host down" alert in that situation, either. You had to manually
configure a check of the network interfaces and the agent itself,
which seemed very counter-intuitive, and let to many situations where
we hadn't properly thought through all failure scenarios to configure
the alarms explicitly enough. All that said, I know some of the issues
we had and raised with Zabbix were marked as pending the 2.x branch,
which is out now-- I'm not sure if they've been resolved or if the
framework to resolve them is now in place.)

However, I'm specifically curious to see what people are using for
environments where the hosts can be spun up and down outside the
control of the normal provisioning channels. I know that there's been
significant work done lately by the Opscode folks to configure Nagios
dynamically via Chef, which is something I've got on my to-research
list when I get beyond the ops programming tasks on my plate right
now. I believe the downside of that would be whatever the interval is
between a node being terminated and the configuration being
regenerated. I know that Zabbix also supports the idea of dynamic node registration, which seems very applicable in this case, but again, I'm
not sure if it's got some kind of pruning capability in place.

I'm also curious to know along these lines if anyone has worked with
a system (either native or with a connector) that will take advantage
of Amazon's CloudWatch metrics. I could certainly monitor things like
CPU and network utilization myself, but if AWS is already doing so,
polling their data seems like it would be easier. (Potentially
cleaner? I'm undecided on that, since it seems like it could introduce
another dependency-- yet I've never seen CloudWatch unavailable when
the core EC2 services were working. However, I may not have explored
it in enough detail to see that kind of failure, so... I remain
undecided.) One of the upsides of integrating with CloudWatch is that
I can monitor the same metrics that autoscaling is operating on, and I
believe actually retrieve those thresholds as well, rather than
needing to configure them by hand (or by role in Chef, but that would
still need to be manually updated if I changed the autoscaling
parameters).

Thanks for any thoughts. :)

--
~*~ StormeRider ~*~

"Every world needs its heroes [...] They inspire us to be better than
we are. And they protect from the darkness that's just around the
corner."

(from Smallville Season 6x1: "Zod")

On why I hate the phrase "that's so lame"... http://bit.ly/Ps3uSS [1]

Links:
------
[1] http://bit.ly/Ps3uSS

_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/

Reply via email to