Re: [Lsr] Multiple failures in Dynamic Flooding

Peter Psenak Mon, 11 Mar 2019 10:21:59 -0700

Hi Huaimo,

On 11/03/2019 18:08 , Huaimo Chen wrote:

Hi Tony,




    In summary for multiple failures, two issues below in
draft-li-lsr-dynamyic-flooding are discussed:

1)      how to determine the current flooding topology is split; and

there is no need to do that. The recovery mechanism will repair thesplit topology if there is a way to do that.


2)      how to repair/connect the flooding topology split.


6.7.11.  Recovery from Multiple Failures


   "The nodes that remain active on the edges
   of the flooding topology partitions will recognize this and will try
   to repair the flooding topology locally by enabling temporary
   flooding towards the nodes that they consider disconnected from the
   flooding topology until a new flooding topology becomes connected
   again."


For the first issue, the discussions are still going on.

For the second issue, repairing/connecting the flooding topology split
through Hello protocol extensions does not work.  When a “backup
path”/connection of multiple hops is needed to connect/repair the
flooding topology split, Hello can not go beyond one hop, thus can not
repair the flooding topology split in this case.


there is no need to send anything multi-hop.

thanks,
Peter

*From:* Tony Li [mailto:[email protected]] *On Behalf Of

*[email protected]

*Sent:* Wednesday, March 6, 2019 10:45 AM
*To:* Huaimo Chen <[email protected]>
*Cc:* Christian Hopps <[email protected]>; [email protected];

[email protected]; [email protected]

*Subject:* Multiple failures in Dynamic Flooding

Hi Huaimo,

I’m sorry that you don’t find it useful. Determining the split is

trivial: when you receive an IIH,

it has a system ID of the another system in it. If that other system is

not currently part of the

flooding topology, then it is quite clear that it is disconnected from

the flooding topology.

Repairing the split is done by enabling temporary flooding on the new link.

For an adjacency between two nodes is up, the Hello packets exchanged

between them will not change node/system IDs in them.

How do you determine that other system is not currently part of the

flooding topology?

The IIH includes the system ID.  See ISO 10589 v2, section 9.7, field

“source Id”.  The local system will have

a copy of the flooding topology and can easily see if the neighbor was

present as of the last FT computation.  If not, then it should be

added (modulo rate limiting). The local system can also examine it’s own

LSDB.  If there is no LSP for the neighbor, then it would seem

highly likely that there is a disconnect and the neighbor should again

be added (modulo rate limiting).

We are not requiring it, but a system could also do a more extensive

computation and compare the links between itself and the neighbor

by tracing the path in the FT and then confirming that each link is up

in the LSDB.



It normally takes a long time such as more than ten minutes to age out
and remove an LSP/LSA for the neighbor from the LSDB even though the
neighbor is disconnected physically.

How can you decide quickly in tens of milliseconds that the flooding
topology is disconnected?

There is an issue here that we have not yet resolved, which is the rate

that new links should be

temporarily added to the flooding topology.  Some believe that adding

any new link is the

correct thing to do as it minimizes the recovery time. Others feel that

enabling too many links

could cause a flooding collapse, so link addition should be highly

constrained. We are still

discussing this and invite the WG’s opinions.

The issue is resolved by the solutions in draft-cc-lsr-flooding-reduction.


One solution is below, where the given distance can be adjusted/configured.

If we want every node to flood on all its links, we let the given

distance to a big number. If we want the nodes within 2 hops to a failure

to flood on all their links, we set the given distance to 2.


   “In one way, when two or more failures on the current flooding

  > >topology occur almost in the same time, each of the nodes within a

  > >given distance (such as 3 hops) to a failure point, floods the link

  > >state (LS) that it receives to all the links (except for the one from

   which the LS is received) until a new flooding topology is built.”

As we have discussed, this is not a solution. In fact, this is more

dangerous than anything else that has been proposed and

seems highly likely to trigger a cascade failure. You are enabling full

flooding for many nodes.  In dense topologies, even

a radius of 3 is very high.  For example, in a LS topology, a radius of

3 is sufficient to enable full flooding throughout the

entire topology. If that were stable, we would not need Dynamic Flooding

at all.



This full flooding is enabled only for a very short time.

How do you get that this is more dangerous than anything else and seems
highly likely to trigger a cascade failure? Can you give some
explanations in details?

Another solution is just adding minimum links temporarily on the flooding

topology to repair the split flooding topology until a new flooding

topology

is built.

Agreed.  Which links constitute the minimum?  In a general topology,

with arbitrary failures that are not distributed globally,

how do we make a distributed decision about which links to enable? This

is the problem that we are trying to solve. And

we have no oracle to tell us The Right Answer.




We can discuss this after the first method is discussed.



Best Regards,

Huaimo

Regards,

Tony




_______________________________________________
Lsr mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/lsr


_______________________________________________
Lsr mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/lsr

Re: [Lsr] Multiple failures in Dynamic Flooding

Reply via email to