[Lsr] Multiple failures in Dynamic Flooding

tony . li Wed, 06 Mar 2019 07:45:38 -0800

Hi Huaimo,


> > I’m sorry that you don’t find it useful. Determining the split is trivial: 
> > when you receive an IIH,
> > it has a system ID of the another system in it. If that other system is not 
> > currently part of the
> > flooding topology, then it is quite clear that it is disconnected from the 
> > flooding topology.
> > Repairing the split is done by enabling temporary flooding on the new link.
>  
> For an adjacency between two nodes is up, the Hello packets exchanged between 
> them will not change node/system IDs in them.
> How do you determine that other system is not currently part of the flooding 
> topology?


The IIH includes the system ID.  See ISO 10589 v2, section 9.7, field “source 
Id”.  The local system will have
a copy of the flooding topology and can easily see if the neighbor was present 
as of the last FT computation.  If not, then it should be
added (modulo rate limiting). The local system can also examine it’s own LSDB.  
If there is no LSP for the neighbor, then it would seem
highly likely that there is a disconnect and the neighbor should again be added 
(modulo rate limiting).

We are not requiring it, but a system could also do a more extensive 
computation and compare the links between itself and the neighbor
by tracing the path in the FT and then confirming that each link is up in the 
LSDB.


> > There is an issue here that we have not yet resolved, which is the rate 
> > that new links should be
> > temporarily added to the flooding topology.  Some believe that adding any 
> > new link is the
> > correct thing to do as it minimizes the recovery time. Others feel that 
> > enabling too many links
> > could cause a flooding collapse, so link addition should be highly 
> > constrained. We are still
> > discussing this and invite the WG’s opinions.
>  
> The issue is resolved by the solutions in draft-cc-lsr-flooding-reduction.
> One solution is below, where the given distance can be adjusted/configured.
> If we want every node to flood on all its links, we let the given
> distance to a big number. If we want the nodes within 2 hops to a failure
> to flood on all their links, we set the given distance to 2.
>    “In one way, when two or more failures on the current flooding
>    topology occur almost in the same time, each of the nodes within a
>    given distance (such as 3 hops) to a failure point, floods the link
>    state (LS) that it receives to all the links (except for the one from
>    which the LS is received) until a new flooding topology is built.”


As we have discussed, this is not a solution. In fact, this is more dangerous 
than anything else that has been proposed and
seems highly likely to trigger a cascade failure. You are enabling full 
flooding for many nodes.  In dense topologies, even
a radius of 3 is very high.  For example, in a LS topology, a radius of 3 is 
sufficient to enable full flooding throughout the
entire topology. If that were stable, we would not need Dynamic Flooding at all.


> Another solution is just adding minimum links temporarily on the flooding
> topology to repair the split flooding topology until a new flooding topology
> is built.


Agreed.  Which links constitute the minimum?  In a general topology, with 
arbitrary failures that are not distributed globally,
how do we make a distributed decision about which links to enable? This is the 
problem that we are trying to solve. And
we have no oracle to tell us The Right Answer.


> The link can be enabled for “temporary flooding” by the node without using 
> any TLV or Hello with the TLV.


There are cases where it is far easier for the neighbor to realize that it is 
disconnected than for the local system to realize
that the neighbor is disconnected.  Thus, it is easier to allow one system to 
request temporary addition. 


> The TLV in Hello packet just requests for adding “temporary flooding” on the 
> link. The other information is accessed by the node locally. The TLV in Hello 
> packet does not help for corner case. In the case where a node is rebooted, a 
> new link attached to a new node may apply.


If the node that rebooted has 1000 interfaces, which interfaces should be 
temporarily added?  Adding all of them is likely to trigger a cascade failure.  
The TLV allows us to signal which ones should be enabled.


> >All adjacencies are a single hop in both IS-IS and OSPF.  Yes, Hello packets 
> >may be lost.
> >Fortunately, they are periodically transmitted, thus the next transmission 
> >will also contain the
> > TLV.  If IIH’s are getting lost at a significant rate, then the adjacency 
> > will not (and should not)
> >come up.  Thus, the request for temporary flooding will propagate to the 
> >neighbor in all cases
> >that matter.
>  
> It takes too long when Hello packet is lost. Repairing split flooding 
> topology needs to be fast.


Fortunately, lost hello packets are a relatively rare occurrence.  While 
repairing the flooding topology needs to be done expediently, attempting to do 
so and triggering a cascade failure of the network is counter-productive. Given 
this alternative, a bit of extra delay when adding a new system to the network, 
or trying to recover from multiple failures seems wise. Rushing and making 
things worse does not.  The first
priority must remain network stability.


> 
> It does not mean that a user/operator configures/select an area leader. It 
> means that a user/operator configures other things such as indicating an 
> algorithm or selecting the centralized mode on the area leader. 


In an implementation, centralized mode and algorithm selection can be the 
defaults.  In fact, in our implementation, the only required configuration is 
to enable dynamic flooding. Everything else is automatic.



Regards,
Tony

_______________________________________________
Lsr mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/lsr

[Lsr] Multiple failures in Dynamic Flooding

Reply via email to