[Lsr] Re: Consensus Call on LSR WG work on "Leaderless Flooding Algorithm for Distributed Flood Reduction to allow reduced configuration, minimal blast radius, and ease of incremental deployment"

Les Ginsberg (ginsberg) Tue, 19 Nov 2024 15:14:02 -0800

I agree with much of what Tony has said.
I would also point out that some of the points raised by Gunter (and responded 
to by Tony) are applicable to centralized mode (as defined in 
https://www.rfc-editor.org/rfc/rfc9667.html#section-4 ).
But RFC 9667 also supports a distributed mode.


Comparisons of leader vs leaderless enablement is most aptly done using the 
distributed mode defined by RFC 9667.
It is clear that leaderless enablement could not support a centralized mode of 
operation.

This is not meant to disparage RFC 9667 centralized mode. Just to highlight 
that I think a lot of the points raised by Gunter (and probably in the minds of 
others) come from the presumed use of RFC 9667 centralized mode.
That isn’t the comparison we should be making.

A separate discussion could be had as to the benefits/drawbacks of centralized 
mode vs distributed mode – but that isn’t the same discussion as leader vs 
leaderless enablement.

   Les


From: Tony Li <[email protected]> On Behalf Of Tony Li
Sent: Tuesday, November 19, 2024 2:45 PM
To: Gunter van de Velde (Nokia) <[email protected]>
Cc: Acee Lindem <[email protected]>; lsr <[email protected]>
Subject: [Lsr] Re: Consensus Call on LSR WG work on "Leaderless Flooding 
Algorithm for Distributed Flood Reduction to allow reduced configuration, 
minimal blast radius, and ease of incremental deployment"


Hi Gunter,

I would like to counteract some of the Fear, Uncertainty, and Doubt that has 
misled you:


Some concerns I see with a flooding leader approach:

  *   Single Point of Failure (the flooding leader becomes a critical component 
in the network. If the leader fails or becomes unreachable, it can disrupt the 
flooding process until a new leader is elected or the network converges to an 
alternative state)


The leader is not a single point of failure, provided that the operator enables 
multiple potential leaders.  Loss of a leader does not disrupt the flooding 
process.  Flooding continues on the flooding topology until a new flooding 
topology is computed.



  *   Increased Complexity (Implementing a flooding leader requires additional 
mechanisms for leader election, maintenance, and failure detection)


The leader election algorithm is taken directly from DIS election.  If you have 
a LAN in your network, you are already running this algorithm.  Leader failure 
detection falls out of SPF already: after SPF, if the leader is disconnected, 
then the local node needs to execute leader election.



  *   Scalability Concerns (in large-scale or highly dynamic networks, managing 
a single flooding leader can become a bottleneck)


Management of the leader is trivial. The leader election is O(N) in the number 
of nodes in the LSDB. This is cheaper than SPF.  If you can run SPF, then you 
can run leader election. I’ll note that all of the flooding topology 
computations that have been proposed are also more expensive than O(N), so if 
you can’t afford leader election, you can’t afford to run a flooding topology 
computation either.



  *   Convergence Delays (when the flooding leader fails, the network must 
initiate a leader re-election process)


Leader failure does not affect convergence.  Regardless of the leader’s status, 
flooding proceeds on the pre-computed flooding topology.  Only after the 
topology change and subsequent SPF is the leader failure noticed and a new 
leader election run.  That will subsequently generate a new flooding topology.  
There is always a valid flooding topology.



  *   Lack of Redundancy (relying on a single leader reduces redundancy in the 
flooding process)


An area should never, ever have a single leader candidate, unless it is the 
only node in the area.  An operator may configure every node to be a leader, if 
that is appropriate for the network.  The redundancy in the flooding process is 
a result of redundancy in the flooding topology and has nothing whatsoever to 
do with the leader or leader election.  The specific flooding topology is the 
result of the flooding algorithm selection.



  *   Overhead of Leader Maintenance (continuous monitoring is required to 
ensure the flooding leader is operational)


The cost of leader maintenance is zero.  SPF is already a sunk cost. As a side 
effect of normal SPF, the protocol infrastructure must already detect 
disconnected nodes and withdraw affected routes already.  It is at this point 
in the code that a new leader election should be triggered, and not before.

If the leader is still reachable, a good implementation does not need to 
execute a single solitary additional instruction.



  *   Potential for Suboptimal Flooding Paths (the flooding leader may not 
always have the most efficient paths to all nodes, especially in dynamic 
topologies)


This has nothing to do with a leader and is solely a property of the flooding 
topology algorithm.  The leader selects the algorithm, nothing more.

Flooding is not done along paths, it is always hop-by-hop.  If an algorithm 
chooses poorly, then yes, the flooding topology can be wildly sub-optimal.  An 
operator should spend SERIOUS amounts of effort understanding how any given 
algorithm behaves in their network. An algorithm that produces a flooding 
topology that is a giant cycle with a latency of 500ms will impact convergence. 
An algorithm that causes a single weak node to have an undue burden will also 
affect convergence. An algorithm that does not produce a bi-connected topology 
will impact network resilience.  I could go on.



  *   Complex Recovery Mechanisms (recovering from leader failures may involve 
complex procedures that differ from standard link-state protocol operations)


The recovery mechanism from leader failure is to elect a new leader.  This is 
not complex, it is a linear scan of the nodes in the area looking for those 
nodes that are eligible and looking at their advertised priorities. It is 
exactly analogous to DIS election.  It takes less than a millisecond.



I believe there is place for both a flooding leader and leaderless 
architecture. It depends upon type of network where this is implemented (for 
example Datacenter or Service Provider WAN).


I have no issues with a leaderless architecture, if we can actually demonstrate 
one that will preserve network reliability, stability, and performance. So far, 
I haven’t seen one other than ubiquitous configuration.  I fail to see any 
distinction between DC or WAN for this.  A dense network is a dense network.

Tony

_______________________________________________
Lsr mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[Lsr] Re: Consensus Call on LSR WG work on "Leaderless Flooding Algorithm for Distributed Flood Reduction to allow reduced configuration, minimal blast radius, and ease of incremental deployment"

Reply via email to