Re: [icinga-users] Handled vs Unacknowledged Question

Matuskiewicz, Philip Fri, 04 May 2012 12:02:07 -0700

Hi Matthew,


Thanks for the thorough explanation!

 

If you’re curious, I’ve been tasked with implementing a monitoring system for 
hardware that is deployed on the New York City Transit bus fleet (roughly 800 
buses thus far and over 5k services, we will grow to roughly 6k busses and even 
more services eventually) using open source solutions with minimal in-house 
custom brewed solutions.  Icinga has thus far been an excellent choice for our 
monitoring needs but this question was a sticking point for me.  

 

Once again, thanks for your help!

 

Phil

 

Philip Matuskiewicz

Systems Developer – MTA Bus Time

2 Broadway, 27th floor - D27.90

Phone (Office): 646-252-8509

[email protected]

________________________________

From: Matthew Brooks [mailto:[email protected]] 
Sent: Friday, May 04, 2012 6:38 AM
To: [email protected]
Subject: Re: [icinga-users] Handled vs Unacknowledged Question

 

On Thu, May 3, 2012 at 11:53 AM, Matuskiewicz, Philip 
<[email protected]> wrote:

        Hi,

         

        In Icinga (both Classic and Web), the overall view of the services has 
3 separate numbers denoted (Unacknowledged / Acknowledged / Handled).  I’ve 
disabled event handlers entirely at the global and host / service level 
configuration, yet half of the services / hosts that are monitored show up as 
Handled, and the other half show up as Unacknowledged in this overall view.  
I’ve been unable to find a pattern that determines how a service is handled vs 
unacknowledged in practice, looking through the source code, and through the 
thorough documentation, and googling related search terms.  I’m using the 
latest stable release of Icinga and Icinga-web.

         

        Could someone please explain the difference between handled and 
unacknowledged, and if it is possible to completely disable handled services 
unless an event handler is explicitly defined and is enabled via the global 
configuration?

         

 

Hello Philip, 

 

That's a good question, perhaps I should put a post up on the Icinga Blog since 
it's both an interesting (at least to me) and important subject. I certainly 
have more than enough material here for it! ;^)

 

First, when it comes to those counts "Handled" does not refer to Event 
Handlers. Think of handled as in someone is handling the problem. The TL;DR 
version is:

 

Unacknowledged == THE SKY IS FALLING!

Acknowledged == It broke, but it's being attended to

Handled == We planned on breaking this host/service on purpose; nothing to see 
here, move along. (in addition for services, the host was Acknowledged)

 

 

The longer version and some of the thinking behind it (at least for the Classic 
TAC Header) goes as follows:

 

--> "Unacknowledged" means a check you were expecting to return UP/OK returned 
for that particular state (Down, Critical, Warning etc) and it hasn't been 
acknowledged... in other words these are "real" problems that haven't gotten 
any attention yet, at least as far as Icinga knows.

 

--> "Acknowledged" means a check you were expecting to return UP/OK returned 
for that particular state (Down, Critical, Warning etc), but that someone has 
acknowledged it. Which means someone has told Icinga that they know about it 
and it can (hopefully) be assumed that it is being dealt with in some fashion. 

 

--> "Handled" means a check you were expecting to probably *NOT* return UP/OK 
returned for that particular state (Down, Critical, Warning etc). Handled is 
determined when a host or service is in scheduled downtime or in the case of 
services, also when its host has been acknowledged. In a way, scheduled 
downtimes are kind of like a pre-acknowledgement of a purposely made 
issue/outage which makes them a distinctly different type of event than 
something that is a surprise and needs to be Acknowledged after the fact.

 

The reason why it's called "Handled" and not "Downtime" (as it was originally 
planned and if I recall correctly it was for a brief time) is because in the 
case of services, when a host is acknowledged any problem services for it also 
now fall under the Handled count. So Downtime wouldn't be appropriate. Not to 
mention "Issues that are being handled Vs. Issues that are not already handled" 
was also the wording I spoke like a mantra for awhile when trying to get the 
original concept across, so it probably also subconsciously stuck.

 

 

The purpose of logically separating them this way makes it easier to see what 
is actually going on in your monitored environment and allows the Classic UI to 
bring the "real" problems to your attention better. Prior to doing so, the 
counts were more roughly clumped together making them much less useful. (BIG 
Kudos to Ricardo for his work in helping to hammer out my original vision by 
doing a lot of work on the backend needed for the counts) Also, in the Classic 
UI prior to the introduction of the TAC header 
<https://www.icinga.org/2011/05/19/the-new-classic-ui-tactical-overview-header-feedback-welcome/>
  in r1.4 the counts were also only available from certain places and were not 
always visible at the top; ready to inform you of any new trouble. 

 

But that's getting off track... needless to say both the finer grained counts 
and the improved visibility of problems were major reasons for writing the 
Classic UI TAC Header in the first place.

 

Anyway, just as an example, previously if you had a host in downtime the count 
for down hosts would be 1, but if the host came out of downtime and at that 
same moment another unrelated host went down the count for down hosts would 
still be 1 and if you weren't watching closely and on the right page, it would 
be all too easy to overlook it.

 

Now if that same scenario were to happen, you would immediately notice that 
there was a difference as the counts in the TAC Header would change from "0 / 0 
/ 1" to "1 / 0 / 0" and the coloring would also change from a red circle with a 
grey middle to a full saturated red that grabs your attention. 

 

So that's it in a nutshell... albeit a giant, wordy nutshell. I hope it helps 
clarify it for you.

 

One last thing though since I mentioned making things that are important stand 
out better... I highly recommend you also enable the 
suppress_maintenance_downtime option in cgi.cfg. Check out the description for 
it in the docs at http://docs.icinga.org/latest/en/configcgi.html

 

 

Matthew

-- 

///: Matthew Brooks

///: Principal Consultant & UNIX/Linux Systems Architect

///: Sonoma Technology Partners, Inc. 
<http://www.sonomatechnologypartners.com/>  <(updated site

///: Mobile: 707.861.0123 :: [email protected] 
<mailto:[email protected]> 


///: Icinga Core Developer [icinga.org <http://icinga.org/> ]

☀GPG Fingerprint: 33C9 E10C 7C95 4AD9 AC2E 3B66 CB03 2B80 D82B 6DED

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

_______________________________________________
icinga-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/icinga-users

Re: [icinga-users] Handled vs Unacknowledged Question

Reply via email to