Hi Morgan, 2013/3/24 Morgan Blackthorne <[email protected]>
> This is a spin-off question related to the other monitoring system thread > we have going, taking it from the general direction towards a specific > use-case scenario. > > I've used several systems throughout the years, the last two notably being > Nagios and Zabbix. Nagios seems better suited for monitoring, while Zabbix > is clearly superior in terms of graphing. Configuring NagiosGraph is... > more difficult than it should be, IMO. The Zabbix agent seemed to be less > reliable than NRPE, however, and last I worked with Zabbix it seemed to > default to not alerting unless explicitly configured to do so. (It's been a > while since we moved away from it, so my memory is a bit foggy. Near as I > can recall, a configured alarm via a Zabbix agent check would not fire if > the agent itself was not reachable, and the system did not natively support > the concept of a "host down" alert in that situation, either. You had to > manually configure a check of the network interfaces and the agent itself, > which seemed very counter-intuitive, and let to many situations where we > hadn't properly thought through all failure scenarios to configure the > alarms explicitly enough. All that said, I know some of the issues we had > and raised with Zabbix were marked as pending the 2.x branch, which is out > now-- I'm not sure if they've been resolved or if the framework to resolve > them is now in place.) > > However, I'm specifically curious to see what people are using for > environments where the hosts can be spun up and down outside the control of > the normal provisioning channels. I know that there's been significant work > done lately by the Opscode folks to configure Nagios dynamically via Chef, > which is something I've got on my to-research list when I get beyond the > ops programming tasks on my plate right now. I believe the downside of that > would be whatever the interval is between a node being terminated and the > configuration being regenerated. I know that Zabbix also supports the idea > of dynamic node registration, which seems very applicable in this case, but > again, I'm not sure if it's got some kind of pruning capability in place. > > I'm also curious to know along these lines if anyone has worked with a > system (either native or with a connector) that will take advantage of > Amazon's CloudWatch metrics. I could certainly monitor things like CPU and > network utilization myself, but if AWS is already doing so, polling their > data seems like it would be easier. (Potentially cleaner? I'm undecided on > that, since it seems like it could introduce another dependency-- yet I've > never seen CloudWatch unavailable when the core EC2 services were working. > However, I may not have explored it in enough detail to see that kind of > failure, so... I remain undecided.) One of the upsides of integrating with > CloudWatch is that I can monitor the same metrics that autoscaling is > operating on, and I believe actually retrieve those thresholds as well, > rather than needing to configure them by hand (or by role in Chef, but that > would still need to be manually updated if I changed the autoscaling > parameters). > I haven't used the new AWS internal stuff; just please take note that the nagios area has undergone major changes in the years since 2009. Stuff like Nagiosgraph is heavily dated. NRPE is almost completely obsolete in new setups. Configurations would be rule-based with inheritance of rules etc. Graphs would like this: Does the check return performance data? => ok lets paint a "PNP4Nagios" graph for it. No template for defining colours? => well lets just use the default template And the gui display just autodetects if there's a valid graph and if yes it'll display it. Stuff can be easy. My old employer was Mathias Kettner GmbH who was responsible for a large amount of those changes also kicked off a project called OMD with has a nice installer bundle to stop wasting time on installing Nagios itself. That package also comes with alternate cores like "Icinga" or "Shinken" who some (some) consider much improved over Nagios. All this stuff can't be claimed to be aimed at cloud monitoring. Just, please, should you check out Nagios again, don't settle for doing stuff like it was done some years ago. Greets, Florian
_______________________________________________ Discuss mailing list [email protected] https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss This list provided by the League of Professional System Administrators http://lopsa.org/
