Let me give you an example:
In Icinga1 -
A host which is the parent for 10 other hosts. It goes down. The 10
devices behind should go unreachable not down. The problem we run into
is the timing. In Icinga1, soft_state_dependencies allows the soft
state of the parent host to be considered for the reachability of the
child hosts. In the case that the child hosts are checked more
frequently or have a fewer max_retries value they could go into a down
hard state, and generate a notification, before the parent host goes to
a hard down state. Then the parent device reaches a down hard state.
All of those child hosts then go into the unreachable state, which we
have configured to not send notifications for. If the parent host is
down, I know everything behind it is going to be down. Let me quote
from the Icinga1 docs:
"By default, Icinga will notify contacts about both DOWN and UNREACHABLE
host states. As an admin/tech, you might not want to get notifications
about hosts that are UNREACHABLE. You know your network structure, and
if Icinga notifies you that your router/firewall is down, you know that
everything behind it is unreachable.
If you want to spare yourself from a flood of UNREACHABLE notifications
during network outages, you can exclude the unreachable (u) option from
the notification_options directive in your host definitions and/or the
host_notification_options directive in your contact definitions."
Which is exactly what we do. But without soft_state_dependencies is all
becomes dependent on timing. The parent host has to get to a hard down
state before the child hosts go to an unreachable state. There is not a
lot of control over that timing. Checking the child hosts less often or
using a higher max_retries value isn't an option for us. Generally
those child hosts are access switches that don't have a high level of
redundancy, or a large quantity of single attached devices. We are more
sensitive to the state of those devices.
It's really about controlling the quantity of alerts that are generated.
Take the example above 1 parent and 10 child hosts. Without
soft_state_dependencies:
If the child hosts reach a hard state before the parent 11 alerts are
generated. 10 from the children and 1 from the parent.
With soft_state_dependencies:
Only 1 alert is generated, and its from the parent host when it reaches
a hard state. The children have already gone to the unreachable state
because the soft state ( we use max_retries = 3 in most cases ) of the
parent is used to evaluate dependent reachability.
I've only been picking on hosts, but the same applies to services as
well. We monitor both the state of interfaces and BGP sessions. The
BGP session is dependent on an interface. Without using the soft state
of the interface, I would get both an alert for the BGP session and the
interface. If the interface is down, I know that the BGP session is
going to be down as well. I don't need alerts for both. Because I
can't control the timing of those 2 checks, there is a high probability
that the BGP session will reach a hard state before the interface does.
In that case I get a notification for both. In the case of using the
soft state as long as one check interval has been completed and the
interface is soft critical, the BGP session won't notify. Since we use
max_retries that will always be true.
On 11/1/2014 9:36 AM, Michael Friedrich wrote:
Am 17.10.2014 um 23:43 schrieb Barry Quiel:
I can't find any reference in the Icinga 2 docs to the Icinga 1
feature soft_state_dependencies. I didn't find any reference on the
monitoring-portal boards or in the icinga users mail archive.
Was this option carried forward from Icinga 1?
Is in an implicit option now?
Was it renamed?
Everything which is not explicitely mentioned on the migration docs, is
not part of Icinga 2's architecture. There are certainly some features
inherited from Nagios as a fork, which have been cut off, or just not
implemented in the new design. There was a long list of features which
has been evaluated step by step for its importance, and some of them
have been re-implemented with new algorithms, some found a new "home"
after a view revisions during the tech preview cycle, and some simply
don't exist. Like someone asked about a problem id lately.
With out that option it makes the dependencies less effective. There
is no way to line up the timing of the checks so that the child
hosts/services check after the parent goes into a hard state. This is
a crucial setting to help reduce the number of alerts around
correlated events.
I don't see how this would make sense with Icinga 2, as checks are
generally not cached, or written as check files on disk, and reaped an
interval (10seconds by default in worst case). But rather, the
reachability is immediately evaluated based on the available object states.
If you could give as a real world example using Icinga 1.x, and how you
have ported this into Icinga 2, we could discuss, and learn, what
problems you encounter with Icinga 2. For now, your explainations are a
bit too vague in my opinion.
Kind regards,
Michael
_______________________________________________
icinga-users mailing list
[email protected]
https://lists.icinga.org/mailman/listinfo/icinga-users
_______________________________________________
icinga-users mailing list
[email protected]
https://lists.icinga.org/mailman/listinfo/icinga-users