Let me give you an example:

In Icinga1 -
A host which is the parent for 10 other hosts. It goes down. The 10 devices behind should go unreachable not down. The problem we run into is the timing. In Icinga1, soft_state_dependencies allows the soft state of the parent host to be considered for the reachability of the child hosts. In the case that the child hosts are checked more frequently or have a fewer max_retries value they could go into a down hard state, and generate a notification, before the parent host goes to a hard down state. Then the parent device reaches a down hard state. All of those child hosts then go into the unreachable state, which we have configured to not send notifications for. If the parent host is down, I know everything behind it is going to be down. Let me quote from the Icinga1 docs:

"By default, Icinga will notify contacts about both DOWN and UNREACHABLE host states. As an admin/tech, you might not want to get notifications about hosts that are UNREACHABLE. You know your network structure, and if Icinga notifies you that your router/firewall is down, you know that everything behind it is unreachable.

If you want to spare yourself from a flood of UNREACHABLE notifications during network outages, you can exclude the unreachable (u) option from the notification_options directive in your host definitions and/or the host_notification_options directive in your contact definitions."

Which is exactly what we do. But without soft_state_dependencies is all becomes dependent on timing. The parent host has to get to a hard down state before the child hosts go to an unreachable state. There is not a lot of control over that timing. Checking the child hosts less often or using a higher max_retries value isn't an option for us. Generally those child hosts are access switches that don't have a high level of redundancy, or a large quantity of single attached devices. We are more sensitive to the state of those devices.

It's really about controlling the quantity of alerts that are generated. Take the example above 1 parent and 10 child hosts. Without soft_state_dependencies: If the child hosts reach a hard state before the parent 11 alerts are generated. 10 from the children and 1 from the parent.
With soft_state_dependencies:
Only 1 alert is generated, and its from the parent host when it reaches a hard state. The children have already gone to the unreachable state because the soft state ( we use max_retries = 3 in most cases ) of the parent is used to evaluate dependent reachability.

I've only been picking on hosts, but the same applies to services as well. We monitor both the state of interfaces and BGP sessions. The BGP session is dependent on an interface. Without using the soft state of the interface, I would get both an alert for the BGP session and the interface. If the interface is down, I know that the BGP session is going to be down as well. I don't need alerts for both. Because I can't control the timing of those 2 checks, there is a high probability that the BGP session will reach a hard state before the interface does. In that case I get a notification for both. In the case of using the soft state as long as one check interval has been completed and the interface is soft critical, the BGP session won't notify. Since we use max_retries that will always be true.

On 11/1/2014 9:36 AM, Michael Friedrich wrote:
Am 17.10.2014 um 23:43 schrieb Barry Quiel:
I can't find any reference in the Icinga 2 docs to the Icinga 1
feature soft_state_dependencies. I didn't find any reference on the
monitoring-portal boards or in the icinga users mail archive.

Was this option carried forward from Icinga 1?
Is in an implicit option now?
Was it renamed?

Everything which is not explicitely mentioned on the migration docs, is
not part of Icinga 2's architecture. There are certainly some features
inherited from Nagios as a fork, which have been cut off, or just not
implemented in the new design. There was a long list of features which
has been evaluated step by step for its importance, and some of them
have been re-implemented with new algorithms, some found a new "home"
after a view revisions during the tech preview cycle, and some simply
don't exist. Like someone asked about a problem id lately.


With out that option it makes the dependencies less effective. There
is no way to line up the timing of the checks so that the child
hosts/services check after the parent goes into a hard state. This is
a crucial setting to help reduce the number of alerts around
correlated events.

I don't see how this would make sense with Icinga 2, as checks are
generally not cached, or written as check files on disk, and reaped an
interval (10seconds by default in worst case). But rather, the
reachability is immediately evaluated based on the available object states.

If you could give as a real world example using Icinga 1.x, and how you
have ported this into Icinga 2, we could discuss, and learn, what
problems you encounter with Icinga 2. For now, your explainations are a
bit too vague in my opinion.

Kind regards,
Michael

_______________________________________________
icinga-users mailing list
[email protected]
https://lists.icinga.org/mailman/listinfo/icinga-users


_______________________________________________
icinga-users mailing list
[email protected]
https://lists.icinga.org/mailman/listinfo/icinga-users

Reply via email to