Re: [lopsa-discuss] Monitoring systems for cloud nodes

Morgan Blackthorne Mon, 25 Mar 2013 08:00:41 -0700

Very interesting! We're currently using 3.2.3 with NagiosGraph on our
legacy CentOS system, what version has these kind of changes in it? I want
to make sure I'm running the right version when I set up the Chef test
config.


Also, if NRPE is no longer in use, what has replaced it?

I've heard of Icinga before, but not Shinken, I'll have to take a look at
those as well. I know one of the problems with the historical Nagios GUI is
that you can't do things like bulk-acknowledge alerts, you have to go
through one by one and ack them individually. I believe Icinga allows you
to do that. I know that I had read that Nagios was moving towards a
PHP-based front-end that would allow them to more quickly iterate on
improvements to the front end, but I wasn't sure if that had actually taken
place or not.

--
~*~ StormeRider ~*~

"Every world needs its heroes [...] They inspire us to be better than we
are. And they protect from the darkness that's just around the corner."

(from Smallville Season 6x1: "Zod")

On why I hate the phrase "that's so lame"... http://bit.ly/Ps3uSS


On Sun, Mar 24, 2013 at 11:30 AM, Florian Heigl <[email protected]>wrote:

> Hi Morgan,
>
> 2013/3/24 Morgan Blackthorne <[email protected]>
>
>> This is a spin-off question related to the other monitoring system thread
>> we have going, taking it from the general direction towards a specific
>> use-case scenario.
>>
>> I've used several systems throughout the years, the last two notably
>> being Nagios and Zabbix. Nagios seems better suited for monitoring, while
>> Zabbix is clearly superior in terms of graphing. Configuring NagiosGraph
>> is... more difficult than it should be, IMO. The Zabbix agent seemed to be
>> less reliable than NRPE, however, and last I worked with Zabbix it seemed
>> to default to not alerting unless explicitly configured to do so. (It's
>> been a while since we moved away from it, so my memory is a bit foggy. Near
>> as I can recall, a configured alarm via a Zabbix agent check would not fire
>> if the agent itself was not reachable, and the system did not natively
>> support the concept of a "host down" alert in that situation, either. You
>> had to manually configure a check of the network interfaces and the agent
>> itself, which seemed very counter-intuitive, and let to many situations
>> where we hadn't properly thought through all failure scenarios to configure
>> the alarms explicitly enough. All that said, I know some of the issues we
>> had and raised with Zabbix were marked as pending the 2.x branch, which is
>> out now-- I'm not sure if they've been resolved or if the framework to
>> resolve them is now in place.)
>>
>> However, I'm specifically curious to see what people are using for
>> environments where the hosts can be spun up and down outside the control of
>> the normal provisioning channels. I know that there's been significant work
>> done lately by the Opscode folks to configure Nagios dynamically via Chef,
>> which is something I've got on my to-research list when I get beyond the
>> ops programming tasks on my plate right now. I believe the downside of that
>> would be whatever the interval is between a node being terminated and the
>> configuration being regenerated. I know that Zabbix also supports the idea
>> of dynamic node registration, which seems very applicable in this case, but
>> again, I'm not sure if it's got some kind of pruning capability in place.
>>
>> I'm also curious to know along these lines if anyone has worked with a
>> system (either native or with a connector) that will take advantage of
>> Amazon's CloudWatch metrics. I could certainly monitor things like CPU and
>> network utilization myself, but if AWS is already doing so, polling their
>> data seems like it would be easier. (Potentially cleaner? I'm undecided on
>> that, since it seems like it could introduce another dependency-- yet I've
>> never seen CloudWatch unavailable when the core EC2 services were working.
>> However, I may not have explored it in enough detail to see that kind of
>> failure, so... I remain undecided.) One of the upsides of integrating with
>> CloudWatch is that I can monitor the same metrics that autoscaling is
>> operating on, and I believe actually retrieve those thresholds as well,
>> rather than needing to configure them by hand (or by role in Chef, but that
>> would still need to be manually updated if I changed the autoscaling
>> parameters).
>>
>
> I haven't used the new AWS internal stuff;
> just please take note that the nagios area has undergone major changes in
> the years since 2009. Stuff like Nagiosgraph is heavily dated. NRPE is
> almost completely obsolete in new setups.
> Configurations would be rule-based with inheritance of rules etc.
> Graphs would like this:
> Does the check return performance data?
> => ok lets paint a "PNP4Nagios" graph for it.
> No template for defining colours?
> => well lets just use the default template
> And the gui display just autodetects if there's a valid graph and if yes
> it'll display it.
>
> Stuff can be easy.
>
> My old employer was Mathias Kettner GmbH who was responsible for a large
> amount of those changes also kicked off a project called OMD with has a
> nice installer bundle to stop wasting time on installing Nagios itself.
> That package also comes with alternate cores like "Icinga" or "Shinken"
> who some (some) consider much improved over Nagios.
>
> All this stuff can't be claimed to be aimed at cloud monitoring. Just,
> please, should you check out Nagios again, don't settle for doing stuff
> like it was done some years ago.
>
> Greets,
> Florian
>
>

_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Re: [lopsa-discuss] Monitoring systems for cloud nodes

Reply via email to