Dear Yingzhen,
The proposed timeslot (14:30-16:00 CET 08-Nov-23) can work for me.

Regards,
Sasha

From: Yingzhen Qu <[email protected]>
Sent: Friday, November 3, 2023 10:13 PM
To: Stewart Bryant <[email protected]>
Cc: Gyan Mishra <[email protected]>; Alexander Vainshtein 
<[email protected]>; [email protected]; rtgwg-chairs 
<[email protected]>; [email protected]; 
[email protected]
Subject: Re: [EXTERNAL] draft-ietf-rtgwg-segment-routing-ti-lfa : A simple 
pathological network fragment

Hi,

I looked at the side meeting schedule, and found the following slot:
 Wednesday (11/8) 14:00 - 15:30.

I hope to have the following key contributors available for the discussion:
Stewart Bryant
Gyan Mishra
Sasha Vainshtein
Stephane Litkowski
Bruno Decraene
Ahmed Bashandy

Of course anyone is welcome to join the discussion.

Please let me know your availability. If this doesn't work out, we'll schedule 
an interim after IETF 118.

Thanks,
Yingzhen




On Fri, Nov 3, 2023 at 2:06 AM Stewart Bryant 
<[email protected]<mailto:[email protected]>> wrote:



On 3 Nov 2023, at 02:43, Gyan Mishra 
<[email protected]<mailto:[email protected]>> wrote:

        Gyan> TI-LFA is a critical draft for operator SR deployments and I 
agree getting it published asap is a good idea.  All vendors that have 
implemented     TI-LFA  have implemented uLoop.  In reality any operator 
deploying TI-LFA would  always deploy uLoop avoidance at the same time per 
vendor recommendation.  The uLoop I-D  is 7 years old and  is mature as every 
vendor that has implemented TI-LFA has also implemented uLoop,  so I think this 
could be slam dunk to do a quick Adoption followed by expedite through WGLC and 
publish.  The other option is combine the drafts which may or may not be 
favorable to the WG.

The uLoop basic concept is simple —>> building a list of adj-sid from PLR to 
RLFA PQ node merge point with a timer set at time T1 post convergence and 
removed when T2 timer pops.  Simple!  The solution for TI-LFA in my mind is not 
complete without uLoop.  The major issue that Stewart pointed out is related to 
multiple entry points or chain of P space nodes preceding the PLR or multiple Q 
space nodes preceding the RLFA PQ node merge point is what I documented in my 
review.  Any of those longer chain of nodes can have uLoop distributed 
convergence cascaded delays.

TI-LFA implementations aim to solve with optimized least number of SID to avoid 
hardware MSD issues to solve the problem using a single node-sid plus maybe an 
adj-sid and at most 4 sid’s.  Use of node-sid yields ECMP along the chain of 
nodes not yet converged resulting in many possible micro loops is the major 
issue that the  hop by hop list of adj-sid’s along the post convergence path 
solves with the uLoop draft.

I don’t know of any other way to resolve the TI-LFA uLoop issue if implemented 
by itself if node-sid ECMP is utilized.  One option but unlikely is in case of 
chain of nodes exists, that TI-LFA if configured by itself w/o uLoop while 
signaling for MSD maximum threshold, can build an adj-sid list across the nodes 
not yet converged from PLR to PQ node merge point.  Other then trying to fix 
TI-LFA so it can work independently of uLoop feature is to do what we have been 
discussing in the thread about adding txt related to micro loops and 
interaction between       TI-LFA draft and uLoop draft.

Cheers,

Gyan

As I noted earlier in the thread, unless you need to ensure that the repair 
path is congruent with the post convergence path for TE reasons, you never need 
more than two labels for a link repair.

If you use the procedures in RFC 7490 then at most you need two labels for link 
failure. One can be a normal MPLS label, the second is a label that get the 
packet from P to Q. When we wrote RFC 7490 we did not have SR, so we were 
expecting to use T-LDP which created additional state in the network. Now 
SR-MPLS is deployed you can use an SR label to get from P to Q and thus avoid 
the need for T-LDP [1].

I would point out that none of this actually requires standardisation, since 
the repair is a unitary action by the PLR and uses existing widely deployed 
MPLS technology, i.e. any path that gets to P then Q will work and any path can 
be chosen that meets the needs of the operator. The notion that forcing the 
repair path to the post convergence path from the PLR  solves all the TE 
problems is questionable since, as was noted the very first time TiLFA was 
mooted, the operational traffic may no longer go via the PLR post convergence. 
It is also clear from these discussions that whilst TiLFA solves the problem of 
micro looping along the path from the PLR to Q space, that is not adequate in 
itself and thus not a useful path constraint.

Simplifying the design to use exiting RFC 7490 with an SR label to get from P 
to Q would not invalidate any TiLFA implementation but would make it clear that 
implementations could chose any path that best suited their needs.

If we expect failure to be a rare event, then we could control the convergence 
with an unoptimised ordered fib solution an approach which is also a unitary 
action at the PLR. Of course the PLR might choose to calculate the optimum path 
cost values to speed up the process.

If we need a more expeditious approach then we can achieve this with a method 
such as nearside tunnelling which also needs at most one ordinary MPLS label.

Now let us go up a level. This is an emergency use safety system. Safety 
engineering teaches two things, firstly that such systems are rarely executed 
and thus bugs may remain hidden for a long time before then manifest 
themselves, and secondly they normally need to applied in circumstances where 
instrumentation is difficult. The design philosophy in such systems is normally 
that they are extremely simple and thus will obviously work under all 
circumstances both those that are “expected” and those that are “reasonably 
unexpected”. This is why most safety systems are at first glance quite 
primitive.

With TiLFA I think we have lost sight of the need for simplicity and thus have 
an higher risk of a repair failure than we would have in a simpler but 
adequately functional alternative approach.

Best regards

Stewart

[1] Node failure is intrinsically more complex for all solutions and many more 
labels (or network state) may be needed. This was written up as the cartwheel 
problem in which a node has a black hole effect on the traffic and you need to 
skim the traffic around the rim of the cartwheel.

Disclaimer

This e-mail together with any attachments may contain information of Ribbon 
Communications Inc. and its Affiliates that is confidential and/or proprietary 
for the sole use of the intended recipient. Any review, disclosure, reliance or 
distribution by others or forwarding without express permission is strictly 
prohibited. If you are not the intended recipient, please notify the sender 
immediately and then delete all copies, including any attachments.
_______________________________________________
rtgwg mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/rtgwg

Reply via email to