Hi, Robert: Aijun Wang China Telecom
> On Dec 2, 2021, at 08:13, Robert Raszuk <[email protected]> wrote: > > > Hi Aijun, > > If you meant that paragraph: > > When only some of the ABRs can't reach the failure node/link, as that > described in Section 3.2, the ABR that can reach the PUAM prefix > should advertise one specific route to this PUAM prefix. The > internal routers within another area can then bypass the ABRs that > can't reach the PUAM prefix, to reach the PUAM prefix. > > If this is it then I think this is the worst possible idea. Moreover it does > not even work as PUA/PULSE have already propagated to remote PEs and did "the > damage". Remote PEs are sitting and waiting for the service layer to > reconverge. [WAJ] No, please the paragraph before it. The internal router will act only when it receives the PUA from all of its ABR. If the nodes in the area receive the PUAM flood from all of its ABR routers, they will start BGP convergence process if there exist BGP session on this PUAM prefix. The PUAM creates a forced fail over action to initiate immediate control plane convergence switchover to alternate egress PE. Without the PUAM forced convergence the down prefix will yield black hole routing resulting in loss of connectivity. > > You are now asking for self ABR to ABR state synchronization (even if by only > keepin an ear open to other ABR's PUAs) and based on that host route > injection and domain wide leaking of those artificial host routes by other > ABRs from a given area which not only does not help but will cause even more > churn domain wide. [WAJ] No, not domain wide. The ABR sit another area border will advertise only the summary address that includes the detail prefix. > > I can think of a few much more elegant solutions, but I will let PULSE > authors come up with their own ideas :) > > - - > > Bottom line is that (putting aside all concerns already voiced by folks on > this list) both PULSE & PUA ideas perhaps can be made to work for the case > where PE really goes down. > > However, to make it work in situations where ABRs just think PEs are down but > they are not, ie. by false positive DOWN events flooded domain wide - the > entire concept becomes much harder to handle in an elegant and scalable way. > And what I illustrated as the reasons for such scenarios is just tip of the > iceberg in pile of reasons why ABR may think PEs went down. There are many > more... [WAJ] The key criteria is that the ABR can’t reach the mentioned PE. PE DOWN is only one kind of them. The OL bit set by the transit router may be other, the links to this PE are broken is another. Then actually PUA doesn’t care the liveness of the PE, as Tony Li’s comments. The ABR cares mainly the “unreachable” status of the PE. > > Kind regards, > Robert > > >> On Thu, Dec 2, 2021 at 12:58 AM Aijun Wang <[email protected]> wrote: >> Hi, Robert: >> >> Aijun Wang >> China Telecom >> >>>> On Dec 2, 2021, at 04:42, Robert Raszuk <[email protected]> wrote: >>>> >>> >>> Apologies 2 corrections: >>> >>> 1) s/to their inter-as/ to their inter-area/ >>> >>> 2) "service stops for configured PULSE timeout (as discussed 200 sec)." >>> Actually in the described case it is much worse ... Service stops forever >>> to such area as service layer may not be at all aware about this kind of >>> false positive ! >>> >>> Btw this is also not an implementation detail as all multi vendor ABRs >>> better work in the same manner. >>> >>> And the robust solution to this case seems to be along the lines of the >>> logic you have described. PULSES must be acted on by L2 ABRs or by remote >>> PEs *only* when all sources of the summaries inject identical PULSE. >> >> [WAJ] >> https://datatracker.ietf.org/doc/html/draft-wang-lsr-prefix-unreachable-annoucement-08#section-4 >> has described such situations. I have also introduced it in the IETF 112 >> meeting. >> Please see the last paragraph of this section. >> >>> >>> That makes the feature a bit more complex .... >>> >>> Thx, >>> R. >>> >>>> On Wed, Dec 1, 2021 at 9:25 PM Robert Raszuk <[email protected]> wrote: >>>> Hi Tony, >>>> >>>> I have been thinking about your email a bit more. Actually the destructive >>>> issue you have described can happen not only in the case of partitioned L1 >>>> areas. >>>> >>>> Deployment scenario: >>>> >>>> It is quite often the case that ABRs connectivity intra-area are very >>>> different to their inter-as connections. That usually means that different >>>> line cards are used to connect to other routers in the local area then >>>> those in the core area. >>>> >>>> So when anything happens to the line card which connects L1 (for example >>>> it goes down, there is massive congestion, protocol queue is full etc ...) >>>> when previously received LSPs expire such ABR may trigger PULSE of all PE >>>> routers domain wide. And all the fuses discussed to prevent massive >>>> flooding will not kick in as there may be just say 10 PEs in the area - >>>> all working just fine. >>>> >>>> The other ABRs will happily continue to inject summaries but service stops >>>> for configured PULSE timeout (as discussed 200 sec). Note that it is full >>>> service stop not switching to a backup path as all PEs in the area PULSED >>>> domain wide. Not good. >>>> >>>> I have not seen any discussion about such a failure case so far. And only >>>> your mail triggered it ! >>>> >>>> Many thx, >>>> R. > _______________________________________________ > Lsr mailing list > [email protected] > https://www.ietf.org/mailman/listinfo/lsr
_______________________________________________ Lsr mailing list [email protected] https://www.ietf.org/mailman/listinfo/lsr
