Huaimo – Some responses inline.

From: Lsr <lsr-boun...@ietf.org> On Behalf Of Huaimo Chen Sent: Monday, March 04, 2019 8:16 PM To: Tony Li <tony1ath...@gmail.com> Cc: lsr@ietf.org; Christian Hopps <cho...@chopps.org>; Acee Lindem (acee) <a...@cisco.com> Subject: Re: [Lsr] Moving Forward [Re: Flooding Reduction Draft Redux] Hi Tony, >From: Tony Li [mailto:tony1ath...@gmail.com] >Sent: Thursday, February 21, 2019 12:32 AM >To: Huaimo Chen <huaimo.c...@huawei.com<mailto:huaimo.c...@huawei.com>> >Cc: Peter Psenak <ppse...@cisco.com<mailto:ppse...@cisco.com>>; Acee Lindem >(acee) <a...@cisco.com<mailto:a...@cisco.com>>; Christian Hopps >><cho...@chopps.org<mailto:cho...@chopps.org>>; >lsr@ietf.org<mailto:lsr@ietf.org> >Subject: Re: [Lsr] Moving Forward [Re: Flooding Reduction Draft Redux] > > >Hi Huaimo, > >>The way in which the flooding topology converges in the centralized >>mode/solution is different >>from that in the distributed mode/solution. In the former, after receiving >>the link states for the failures, >>the leader computes a new flooding topology and floods it to every other >>node, which receives >>and installs the new flooding topology. The working load on every non leader >>node is light. It has more >>processing power for a procedure/method for fault tolerance to failures. >>However, in the latter, every node computes and installs a new flooding >>topology after receiving >>the link states for the failures. It has less processing power for a >>procedure/method for fault tolerance. >>It is better to let each of the two modes use its own procedure/method for >>fault tolerance to failures, >>which is more appropriate to it. > >It’s true that a distributed solution will call more on an average node than a >centralized >solution will. However, that is not the steady state for either. In the >steady state, the flooding topology has been computed and has been put in >place already. >Thus, the impact of the topology computation at the time of the >topology change is nil. > >In addition, the amount of work to temporarily amend the flooding topology >should also >be minimal, and by that, I mean O(log n). The decision should only >be whether or not to temporarily add a link to flooding, and the only >information that a node >needs to do that is to determine if the node is already on the >flooding topology. That should be a lookup in a tree that represents the nodes >on the topology, >and that lookup should be O(log n). In other words, it’s fast >and efficient and not a significant drain on resources. > When multiple failures happen, the current flooding topology changes, the procedure for fault tolerance to failures is triggered to run, and a new flooding topology is to be computed. We need to have a converged flooding topology as soon as possible. In the distributed solution/mode, if a procedure for fault tolerance, which is not appropriate to it, is used, then we will have a converged flooding topology in a longer time. For example, after multiple failures occur, one procedure (in rough idea) for fault tolerance includes: 1) determine whether the current flooding topology splits, 2) compute backup paths to connect the split flooding topology, 3) enable/request the temporary flooding on the backup paths through extensions to Hello protocol. We can see that this procedure for fault tolerance takes a longer time than the algorithm computes a new flooding topology. This procedure will delay the convergence of flooding topology, which is not appropriate to the distributed solution/mode. So it is better for the distributed solution/mode to use a procedure for fault tolerance, which is more appropriate to it. [Les:] Given that you do not define what you think we should do I cannot comment on whatever alternative you might have in mind. I can say that your discussion does not acknowledge that BEFORE I can compute a new flooding topology I have to make sure I know what the updated full network topology is. This is what is compromised when the old flooding topology becomes partitioned. So the first priority has to be acquiring the updated topology. It would be useful if you replied to the thread that Tony started earlier today where he asks for input on how best to use temporary additions to the flooding topology. One extreme (my words – not Tony’s) would be to enable flooding on all links. This clearly risks introducing a destabilizing flooding storm. The other extreme would be to enable temporary flooding on a “minimal set of links”. This clearly risks delaying convergence. If this topic interests you, please reply to Tony’s new thread (“Open issues with Dynamic Flooding”). >>In the centralized solution/mode, scheduling an algorithm to compute flooding >>topology happens >>only on the leader, and then on the backup leader after the leader fails. The >>parameters for >>scheduling on the leader may be different from those for scheduling on the >>backup leader. >>However, in the distributed solution/mode, scheduling an algorithm to compute >>flooding topology >>occurs on every node. The parameters for scheduling on all the nodes need to >>be the same. > > >Actually, that’s not true. An implementation is free to do its own internal >scheduling >however it chooses, regardless of whether it implements a >distributed or centralized implementation. > > >>The procedure for achieving this is specific to the distributed mode/solution. > >More accurately, it is specific to a given implementation. > > >>If every particular algorithm for computing flooding topology in the >>distributed solution/mode >>describes a procedure for scheduling in details itself, there will be >>duplicated descriptions of >>the same procedure in multiple algorithms, one of which is selected to >>compute flooding >>topology on every node. It is better for the same scheduling procedure for >>multiple algorithms >>to be described in one document. > > >Actually, since the IETF should not be specifying the details of scheduling as >it is an >implementation detail, as they do not affect the behavior of the protocol, it >should not be >discussed in any documents. In multiple vendor networks, using different implementations will create more micro routing loops during the convergence process due to discrepancies of parameters/timers for scheduling than using a same implementation. More micro routing loops will lead to more traffic lose. Service providers are already aware to use similar timers (values and behavior), but sometimes it is not possible due to limitations of implementations. Here we come to a point whether we need to have a same scheduling procedure for a flooding topology computation algorithm to be implemented by multiple vendors. If we do not have a same scheduling procedure, then service providers will have different scheduling implementations/procedures from different vendors, which will create more micro routing loops, leading to more traffic lose. If we have a same scheduling procedure, then service providers will have the same scheduling procedure from different vendors, which will create less micro routing loops. Thus we will have less traffic lose. We can see that there is a need to have a same scheduling procedure. [Les:] If your concern is that we do not want one node to apply a delay of 50 ms and another node to apply a delay of 10 seconds I think we can easily agree on that. But we have many years of experience in configuring consistent SPF delay timers and I think that is applicable here as well. I don’t think this is a point of concern or controversy. Les Best Regards, Huaimo >Regards, >Tony

_______________________________________________ Lsr mailing list Lsr@ietf.org https://www.ietf.org/mailman/listinfo/lsr