Hi Matthew,
Thanks for the thorough explanation!
If you’re curious, I’ve been tasked with implementing a monitoring system for
hardware that is deployed on the New York City Transit bus fleet (roughly 800
buses thus far and over 5k services, we will grow to roughly 6k busses and even
more services eventually) using open source solutions with minimal in-house
custom brewed solutions. Icinga has thus far been an excellent choice for our
monitoring needs but this question was a sticking point for me.
Once again, thanks for your help!
Phil
Philip Matuskiewicz
Systems Developer – MTA Bus Time
2 Broadway, 27th floor - D27.90
Phone (Office): 646-252-8509
philip.matuskiew...@nyct.com
________________________________
From: Matthew Brooks [mailto:matt...@sonomatechpartners.com]
Sent: Friday, May 04, 2012 6:38 AM
To: icinga-users@lists.sourceforge.net
Subject: Re: [icinga-users] Handled vs Unacknowledged Question
On Thu, May 3, 2012 at 11:53 AM, Matuskiewicz, Philip
<philip.matuskiew...@nyct.com> wrote:
Hi,
In Icinga (both Classic and Web), the overall view of the services has
3 separate numbers denoted (Unacknowledged / Acknowledged / Handled). I’ve
disabled event handlers entirely at the global and host / service level
configuration, yet half of the services / hosts that are monitored show up as
Handled, and the other half show up as Unacknowledged in this overall view.
I’ve been unable to find a pattern that determines how a service is handled vs
unacknowledged in practice, looking through the source code, and through the
thorough documentation, and googling related search terms. I’m using the
latest stable release of Icinga and Icinga-web.
Could someone please explain the difference between handled and
unacknowledged, and if it is possible to completely disable handled services
unless an event handler is explicitly defined and is enabled via the global
configuration?
Hello Philip,
That's a good question, perhaps I should put a post up on the Icinga Blog since
it's both an interesting (at least to me) and important subject. I certainly
have more than enough material here for it! ;^)
First, when it comes to those counts "Handled" does not refer to Event
Handlers. Think of handled as in someone is handling the problem. The TL;DR
version is:
Unacknowledged == THE SKY IS FALLING!
Acknowledged == It broke, but it's being attended to
Handled == We planned on breaking this host/service on purpose; nothing to see
here, move along. (in addition for services, the host was Acknowledged)
The longer version and some of the thinking behind it (at least for the Classic
TAC Header) goes as follows:
--> "Unacknowledged" means a check you were expecting to return UP/OK returned
for that particular state (Down, Critical, Warning etc) and it hasn't been
acknowledged... in other words these are "real" problems that haven't gotten
any attention yet, at least as far as Icinga knows.
--> "Acknowledged" means a check you were expecting to return UP/OK returned
for that particular state (Down, Critical, Warning etc), but that someone has
acknowledged it. Which means someone has told Icinga that they know about it
and it can (hopefully) be assumed that it is being dealt with in some fashion.
--> "Handled" means a check you were expecting to probably *NOT* return UP/OK
returned for that particular state (Down, Critical, Warning etc). Handled is
determined when a host or service is in scheduled downtime or in the case of
services, also when its host has been acknowledged. In a way, scheduled
downtimes are kind of like a pre-acknowledgement of a purposely made
issue/outage which makes them a distinctly different type of event than
something that is a surprise and needs to be Acknowledged after the fact.
The reason why it's called "Handled" and not "Downtime" (as it was originally
planned and if I recall correctly it was for a brief time) is because in the
case of services, when a host is acknowledged any problem services for it also
now fall under the Handled count. So Downtime wouldn't be appropriate. Not to
mention "Issues that are being handled Vs. Issues that are not already handled"
was also the wording I spoke like a mantra for awhile when trying to get the
original concept across, so it probably also subconsciously stuck.
The purpose of logically separating them this way makes it easier to see what
is actually going on in your monitored environment and allows the Classic UI to
bring the "real" problems to your attention better. Prior to doing so, the
counts were more roughly clumped together making them much less useful. (BIG
Kudos to Ricardo for his work in helping to hammer out my original vision by
doing a lot of work on the backend needed for the counts) Also, in the Classic
UI prior to the introduction of the TAC header
<https://www.icinga.org/2011/05/19/the-new-classic-ui-tactical-overview-header-feedback-welcome/>
in r1.4 the counts were also only available from certain places and were not
always visible at the top; ready to inform you of any new trouble.
But that's getting off track... needless to say both the finer grained counts
and the improved visibility of problems were major reasons for writing the
Classic UI TAC Header in the first place.
Anyway, just as an example, previously if you had a host in downtime the count
for down hosts would be 1, but if the host came out of downtime and at that
same moment another unrelated host went down the count for down hosts would
still be 1 and if you weren't watching closely and on the right page, it would
be all too easy to overlook it.
Now if that same scenario were to happen, you would immediately notice that
there was a difference as the counts in the TAC Header would change from "0 / 0
/ 1" to "1 / 0 / 0" and the coloring would also change from a red circle with a
grey middle to a full saturated red that grabs your attention.
So that's it in a nutshell... albeit a giant, wordy nutshell. I hope it helps
clarify it for you.
One last thing though since I mentioned making things that are important stand
out better... I highly recommend you also enable the
suppress_maintenance_downtime option in cgi.cfg. Check out the description for
it in the docs at http://docs.icinga.org/latest/en/configcgi.html
Matthew
--
///: Matthew Brooks
///: Principal Consultant & UNIX/Linux Systems Architect
///: Sonoma Technology Partners, Inc.
<http://www.sonomatechnologypartners.com/> <(updated site
///: Mobile: 707.861.0123 :: matt...@sonomatechpartners.com
<mailto:matt...@sonomatechpartners.com>
///: Icinga Core Developer [icinga.org <http://icinga.org/> ]
☀GPG Fingerprint: 33C9 E10C 7C95 4AD9 AC2E 3B66 CB03 2B80 D82B 6DED
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
icinga-users mailing list
icinga-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/icinga-users