[Lsr] Re: Consensus Call on LSR WG work on "Leaderless Flooding Algorithm for Distributed Flood Reduction to allow reduced configuration, minimal blast radius, and ease of incremental deployment"

Tony Li Tue, 19 Nov 2024 14:46:14 -0800

Hi Gunter,

I would like to counteract some of the Fear, Uncertainty, and Doubt that has 
misled you:



> Some concerns I see with a flooding leader approach:
> Single Point of Failure (the flooding leader becomes a critical component in 
> the network. If the leader fails or becomes unreachable, it can disrupt the 
> flooding process until a new leader is elected or the network converges to an 
> alternative state)


The leader is not a single point of failure, provided that the operator enables 
multiple potential leaders.  Loss of a leader does not disrupt the flooding 
process.  Flooding continues on the flooding topology until a new flooding 
topology is computed.


> Increased Complexity (Implementing a flooding leader requires additional 
> mechanisms for leader election, maintenance, and failure detection)


The leader election algorithm is taken directly from DIS election.  If you have 
a LAN in your network, you are already running this algorithm.  Leader failure 
detection falls out of SPF already: after SPF, if the leader is disconnected, 
then the local node needs to execute leader election.


> Scalability Concerns (in large-scale or highly dynamic networks, managing a 
> single flooding leader can become a bottleneck)


Management of the leader is trivial. The leader election is O(N) in the number 
of nodes in the LSDB. This is cheaper than SPF.  If you can run SPF, then you 
can run leader election. I’ll note that all of the flooding topology 
computations that have been proposed are also more expensive than O(N), so if 
you can’t afford leader election, you can’t afford to run a flooding topology 
computation either.


> Convergence Delays (when the flooding leader fails, the network must initiate 
> a leader re-election process)


Leader failure does not affect convergence.  Regardless of the leader’s status, 
flooding proceeds on the pre-computed flooding topology.  Only after the 
topology change and subsequent SPF is the leader failure noticed and a new 
leader election run.  That will subsequently generate a new flooding topology.  
There is always a valid flooding topology.


> Lack of Redundancy (relying on a single leader reduces redundancy in the 
> flooding process)


An area should never, ever have a single leader candidate, unless it is the 
only node in the area.  An operator may configure every node to be a leader, if 
that is appropriate for the network.  The redundancy in the flooding process is 
a result of redundancy in the flooding topology and has nothing whatsoever to 
do with the leader or leader election.  The specific flooding topology is the 
result of the flooding algorithm selection.


> Overhead of Leader Maintenance (continuous monitoring is required to ensure 
> the flooding leader is operational)


The cost of leader maintenance is zero.  SPF is already a sunk cost. As a side 
effect of normal SPF, the protocol infrastructure must already detect 
disconnected nodes and withdraw affected routes already.  It is at this point 
in the code that a new leader election should be triggered, and not before.

If the leader is still reachable, a good implementation does not need to 
execute a single solitary additional instruction.


> Potential for Suboptimal Flooding Paths (the flooding leader may not always 
> have the most efficient paths to all nodes, especially in dynamic topologies)


This has nothing to do with a leader and is solely a property of the flooding 
topology algorithm.  The leader selects the algorithm, nothing more. 

Flooding is not done along paths, it is always hop-by-hop.  If an algorithm 
chooses poorly, then yes, the flooding topology can be wildly sub-optimal.  An 
operator should spend SERIOUS amounts of effort understanding how any given 
algorithm behaves in their network. An algorithm that produces a flooding 
topology that is a giant cycle with a latency of 500ms will impact convergence. 
An algorithm that causes a single weak node to have an undue burden will also 
affect convergence. An algorithm that does not produce a bi-connected topology 
will impact network resilience.  I could go on.


> Complex Recovery Mechanisms (recovering from leader failures may involve 
> complex procedures that differ from standard link-state protocol operations)


The recovery mechanism from leader failure is to elect a new leader.  This is 
not complex, it is a linear scan of the nodes in the area looking for those 
nodes that are eligible and looking at their advertised priorities. It is 
exactly analogous to DIS election.  It takes less than a millisecond.

 
> I believe there is place for both a flooding leader and leaderless 
> architecture. It depends upon type of network where this is implemented (for 
> example Datacenter or Service Provider WAN).


I have no issues with a leaderless architecture, if we can actually demonstrate 
one that will preserve network reliability, stability, and performance. So far, 
I haven’t seen one other than ubiquitous configuration.  I fail to see any 
distinction between DC or WAN for this.  A dense network is a dense network.

Tony

_______________________________________________
Lsr mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[Lsr] Re: Consensus Call on LSR WG work on "Leaderless Flooding Algorithm for Distributed Flood Reduction to allow reduced configuration, minimal blast radius, and ease of incremental deployment"

Reply via email to