Re: [Lsr] Moving Forward [Re: Flooding Reduction Draft Redux]

Les Ginsberg (ginsberg) Mon, 04 Mar 2019 21:08:22 -0800

Huaimo –

Some responses inline.

From: Lsr <[email protected]> On Behalf Of Huaimo Chen
Sent: Monday, March 04, 2019 8:16 PM
To: Tony Li <[email protected]>
Cc: [email protected]; Christian Hopps <[email protected]>; Acee Lindem (acee) 
<[email protected]>
Subject: Re: [Lsr] Moving Forward [Re: Flooding Reduction Draft Redux]

Hi Tony,

>From: Tony Li [mailto:[email protected]]
>Sent: Thursday, February 21, 2019 12:32 AM
>To: Huaimo Chen <[email protected]<mailto:[email protected]>>
>Cc: Peter Psenak <[email protected]<mailto:[email protected]>>; Acee Lindem 
>(acee) <[email protected]<mailto:[email protected]>>; Christian Hopps 
>><[email protected]<mailto:[email protected]>>; 
>[email protected]<mailto:[email protected]>
>Subject: Re: [Lsr] Moving Forward [Re: Flooding Reduction Draft Redux]
>
>
>Hi Huaimo,
>
>>The way in which the flooding topology converges in the centralized 
>>mode/solution is different
>>from that in the distributed mode/solution. In the former, after receiving 
>>the link states for the failures,
>>the leader computes a new flooding topology and floods it to every other 
>>node, which receives
>>and installs the new flooding topology. The working load on every non leader 
>>node is light. It has more
>>processing power for a procedure/method for fault tolerance to failures.
>>However, in the latter, every node computes and installs a new flooding 
>>topology after receiving
>>the link states for the failures. It has less processing power for a 
>>procedure/method for fault tolerance.
>>It is better to let each of the two modes use its own procedure/method for 
>>fault tolerance to failures,
>>which is more appropriate to it.
>
>It’s true that a distributed solution will call more on an average node than a 
>centralized
>solution will. However, that is not the steady state for either. In the
>steady state, the flooding topology has been computed and has been put in 
>place already.
>Thus, the impact of the topology computation at the time of the
>topology change is nil.
>
>In addition, the amount of work to temporarily amend the flooding topology 
>should also
>be minimal, and by that, I mean O(log n).  The decision should only
>be whether or not to temporarily add a link to flooding, and the only 
>information that a node
>needs to do that is to determine if the node is already on the
>flooding topology. That should be a lookup in a tree that represents the nodes 
>on the topology,
>and that lookup should be O(log n). In other words, it’s fast
>and efficient and not a significant drain on resources.
>

When multiple failures happen, the current flooding topology changes, the 
procedure for fault tolerance to failures is triggered to run, and a new 
flooding topology is to be computed. We need to have a converged flooding 
topology as soon as possible.
In the distributed solution/mode, if a procedure for fault tolerance, which is 
not appropriate to it, is used, then we will have a converged flooding topology 
in a longer time.
For example, after multiple failures occur, one procedure (in rough idea) for 
fault tolerance includes: 1) determine whether the current flooding topology 
splits, 2) compute backup paths to connect the split flooding topology, 3) 
enable/request the temporary flooding on the backup paths through extensions to 
Hello protocol. We can see that this procedure for fault tolerance takes a 
longer time than the algorithm computes a new flooding topology. This procedure 
will delay the convergence of flooding topology, which is not appropriate to 
the distributed solution/mode.
So it is better for the distributed solution/mode to use a procedure for fault 
tolerance, which is more appropriate to it.

[Les:] Given that you do not define what you think we should do I cannot 
comment on whatever alternative you might have in mind.

I can say that your discussion does not acknowledge that BEFORE I can compute a 
new flooding topology I have to make sure I know what the updated full network 
topology is. This is what is compromised when the old flooding topology becomes 
partitioned. So the first priority has to be acquiring the updated topology.

It would be useful if you replied to the thread that Tony started earlier today 
where he asks for input on how best to use temporary additions to the flooding 
topology.

One extreme (my words – not Tony’s) would be to enable flooding on all links. 
This clearly risks introducing a destabilizing flooding storm.

The other extreme would be to enable temporary flooding on a “minimal set of 
links”. This clearly risks delaying convergence.

If this topic interests you, please reply to Tony’s new thread (“Open issues 
with Dynamic Flooding”).

>>In the centralized solution/mode, scheduling an algorithm to compute flooding 
>>topology happens
>>only on the leader, and then on the backup leader after the leader fails. The 
>>parameters for
>>scheduling on the leader may be different from those for scheduling on the 
>>backup leader.
>>However, in the distributed solution/mode, scheduling an algorithm to compute 
>>flooding topology
>>occurs on every node. The parameters for scheduling on all the nodes need to 
>>be the same.
>
>
>Actually, that’s not true.  An implementation is free to do its own internal 
>scheduling
>however it chooses, regardless of whether it implements a
>distributed or centralized implementation.
>
>
>>The procedure for achieving this is specific to the distributed mode/solution.
>
>More accurately, it is specific to a given implementation.
>
>
>>If every particular algorithm for computing flooding topology in the 
>>distributed solution/mode
>>describes a procedure for scheduling in details itself, there will be 
>>duplicated descriptions of
>>the same procedure in multiple algorithms, one of which is selected to 
>>compute flooding
>>topology on every node. It is better for the same scheduling procedure for 
>>multiple algorithms
>>to be described in one document.
>
>
>Actually, since the IETF should not be specifying the details of scheduling as 
>it is an
>implementation detail, as they do not affect the behavior of the protocol, it 
>should not be
>discussed in any documents.

In multiple vendor networks, using different implementations will create more 
micro routing loops during the convergence process due to discrepancies of 
parameters/timers for scheduling than using a same implementation. More micro 
routing loops will lead to more traffic lose. Service providers are already 
aware to use similar timers (values and behavior), but sometimes it is not 
possible due to limitations of implementations.
Here we come to a point whether we need to have a same scheduling procedure for 
a flooding topology computation algorithm to be implemented by multiple 
vendors. If we do not have a same scheduling procedure, then service providers 
will have different scheduling implementations/procedures from different 
vendors, which will create more micro routing loops, leading to more traffic 
lose. If we have a same scheduling procedure, then service providers will have 
the same scheduling procedure from different vendors, which will create less 
micro routing loops. Thus we will have less traffic lose.
We can see that there is a need to have a same scheduling procedure.

[Les:] If your concern is that we do not want one node to apply a delay of 50 
ms and another node to apply a delay of 10 seconds I think we can easily agree 
on that. But we have many years of experience in configuring consistent SPF 
delay timers and I think that is applicable here as well. I don’t think this is 
a point of concern or controversy.

   Les

Best Regards,
Huaimo

>Regards,
>Tony

_______________________________________________
Lsr mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/lsr

Re: [Lsr] Moving Forward [Re: Flooding Reduction Draft Redux]

Reply via email to