Robert, -- Rafal Szarecki
From: Robert Raszuk <[email protected]> Date: Friday, March 1, 2019 at 7:50 AM To: Rafal Szarecki <[email protected]>, Natrajan Venkataraman <[email protected]> Cc: “[email protected]” <[email protected]> Subject: Re: [GROW] Review request for draft-szarecki-grow-abstract-nh-scaleout-peering-00 Hey Rafał, Just do not set BGP NH to ANH in export policy. In your Junos cfg example (slide 28) ANH is applied to EBGP peers so I am not sure how I can send something different to any IBGP peer: [edit protocols bgp group PeerAS2] type external; egress-te { install-address 1.1.1.2; rib { inet.0; } } peer-as 2; neighbor 11.1.1.1; neighbor 11.1.1.5; [RJS] above config conditionally install ANH (1.1.1.2) in RIB (inet.0 in JUNOS) depending of state of two sessions with OR logic. On same slide (http://www.cvent.com/events/nanog-75/custom-116-948222eca5834bc2b7a679399063e724.aspx) you have policy “SET-AP-ANH” atha set next-hop to 1.1.1.2 and is attached as export policy to group “to-RR” [protocols bgp group to-RR] export [ ... SET-AP-ANH ] neighbor 11.0.0.6 [policy-options policy-statement SET-AP-ANH term 10] from { as-path fromAS2; } then { next-hop 1.1.1.2; } [RJS] I agree. Memory footprint at RR is not an issue. Convergence at scale is. Convergence is not a problem .. in fact no one should be counting on protocol *convergence* for fast connectivity restoration at any event of failure these days. [RJS] Let me rephrase connectivity restoration is a problem. To do this fast we need: 1. Backup information available on all devices across network. Ideally pre-programed in data-plane 2. Ability to switch multiple 1000’s of prefixes form primary to backup as result on ONE event (unreachability of BGP NH address == BGP PIC). To exercise this BGP NG may not be ASBR loopback as it stays in IGP as long as ASBR is UP. Let assume at site 1 I have 4 ASBRS connected to AS_2 each with 1 sessions, and this ASBR learns 300k prefixes form AS_2 and all of then are best from each ASBR POV. So 300k path per ASBR, 300k pfx per ASBR, 1 path per prefix per ASBR. The RR gets 4 x 300k pfx with BGP NH set to ASBR1-2-3-4 loopbacks. And send it w/ ADD-PAT to on-site CR When eBGP session of one of ASBR (say ASBR1) fails, it has to withdraw 300k path from RR, and RR need to withdraw 300k path form CR. Untill this is done CR will keep sending ¼ oftraffic to ASBR1, and BGP NH == loopback is reachable. Now if ANH is used, CR sees 4 path per prefix with BGP NH == ANH1-2-3-4 respectively. When eBGP session of one of ASBR (say ASBR1) fails, it removes ANH1 form IGP and start to withdraw 300k path from RR, and RR need to withdraw 300k path form CR. As soon as CR sees IGP update (ANH1 unreachable) it can mark all 300k path that have BGP NH == ANH1 unusable. And stop forwarding to ASBR1. If CR runs BGP PIC EDGE it could be sub-second. All of this works fine today out of the box if you do not set next hop self on your ASBRs (typical cfg in non MPLS networks :). [RJS] I have impression that NHS is quite popular in non-MPLS as well. I heard it for few customers: * Less losses if external interface flaps and backup exist (typical ASBR has bunch of peers + 1-2 transit/uplink providers) * Peer address may belong to peers IP space. Caring it inside IGP is perceived as operationally problematic - harder to audit/secure network. Not that I personally buy it fully, but I see the point. And if you do set nhp there as mentioned you can control when it is redistributed into IGP (or removed from it) by tracking any object (or set of objects) you define. Very simple. [RJS] I do not get this. You mean set BGP NH to peer_IP or loopback? * Peer_IP – There is more than one way to skin the cat. To get this address into IGP you can: * Configure inter-AS interface under IGP w/ passive option. Then if “object” is tracked, this configuration need to be altered in case of object is unavailable/fault-state. * Redistribute interface subnet to IGP via policy/route-map. Then if “object” is tracked, this policy configuration need to be altered in case of object is unavailable/fault-state. * The issue with Peer_IP is that it is inherently 1:1 assotiated with session, so if I want to group sessions (see slide 14, where ASBR1.1 has 2 link/sessions to AS2, and slide 12 which explain that peer_AS may have also scale-out topology w/ multiple ASBR @ site) and remove BGP NH address form IGP if given ASBR loose all sessions in group? * Loopback_IP * Let say we may have multiple IP on loopback (or multiple loopbacks) and IP1 used for managing is different from IP2 used to source iBGP sessions which is different form IP3-IPn used to be BGP NH. This way, we may redistribute IP3-IPn to IGP conditionally w/ tracking of some “object” * This allows me 1:1 as well N:1 assotiation of eBGP sessions to BGP NH * And this is just another way of implementing ANH. It is exactly the same * The question now what is tracked “object” and how we can do this locally. * In draft we track state and EoR of set of sessions. So if session is ESTABLISHED but still in initial learning, ANH is not in IGP. This is to prevent: * Prevent churn when max-prefix is enabled on session and is exceeded. To session will be turn down. * Give some delay to data-plane be programmed before make given ASBR available exit for rest of network. * The other option could be waiting for first BGP keep-a-live after session enter ESTABL state. Yet another will be monitor some “lead” prefix we know could be learned over sesession from set associated with given ANH only. This is local implementation. In Junos we choose one way, we believe is good, but other implementation can do this it own way (e.g. I know IOS-XR monitors 1st keep-a-live) The inter-site operation – advertising only one path w/ BGP NH representing “set of eBGP sessions from set of ASBRS” is just one more optimization. Let call this SP_ANH (Site-Peer ANH in contrast to above discussed ASBR-Peer ANH). · If RR advertise to other sites only one path and BGP NH is loopback of one of ASBRs (or ASBR-Peer ANH), then what is convergence in case of this ASBR failure? RR has to send 300k path w/ new BGP NH. Until this is done, remote sites will send traffic somewhere elsw. Not best egress site. Advertise not one but two paths each coming from different ASBRs. [RJS] This will work as well to a degree. It obviously requires ADD-PATH among sites. * what if I have 8 ASBRs and by bad luck 2 of them have lost their sessions – the 2 RR choose to re-advertise? I have still 6 ASBR capable to send traffic out, but RR need to remove BGP NH form IGP for both of them. And re-advertise all path with yet another BGP NH. As result traffic will be lost or sent to sub-optimal exit by time needed for BGP to converge. * The other issue is that · · If RR advertise to other sites only all 4 path and BGP NH is loopback of one of ASBRs (or ASBR-Peer ANH), then IGP update removing this address will allow for quick restoration (as other 3 path are available everywhere). But in multi-path scenario, we just sreated 2-3 level of ECMP structure on remote BGP speakers: prefix--> (list of 4 BGP NH) --> each BGP NH --> list of IGP ECMP neighbours. That costly to manage in S/W and in HW If you advertise ANH per ASBR you will still have 4 different next hops so exactly the case as above. [RJS] No. My RR is doing: 1. Toward CR and ASBR in same site: ADD-PATH with BGP NH unchanged (per ASBR ANH) 2. Toward other sites – another BGP NH replacement w/ Site-Peer ANH (Area-Wide) This way other sites sees BGP NH as it would be anycast for all ASBRs. But inter-site path have ASBR-specific BGP NH. There is hierarchy of abstraction. When packet arrives on CR of egress site, there is N path with per-ASBR specific NH. One may ask what if MPLS is used and CR is LSR only. I this case BGP advertisement and BGP NH manipulation is exactly same as above. However Inter-site LSP FEC is Site-Peer ANH, an it must be inserted to IGP/LDP from some node that has full routing – let say from all ASBR (as anycast). It is in addition to ASBR-Peer ANH insertion to IGP. In this case traffic may indeed leave tunnel on ASBR, hows best route is not external (due to asymmetry you described below and discussed in draft). However as ASBRs get routes form other ASBRs of same site from RR with ASBR-Peer ANH (ASBR specific) it will U-turn traffic to correct ASBR via intra-site infrastructure. It is suboptimal routing, but not cause loop. It is tradeoff – Full IP lookup on CR (or possibly “spine”/aggregation device sitting between ASBR and CR) or intra-site suboptimal routing but only ASBRs need to have Full IP FIB. I do not pretend that this solution is holy grail of networking and solves world hunger problem. It is Just a tool/solution that may help, and network owner is to judge if it serves him well. If you however proposing to advertise all paths from all ASBRs with the same next hop (in your case AS-WIDE ANH) - brilliant - but how are you assuring that all EBGP peers send you symmetric routes ? If you get some EBGP sessions giving you partial BGP reachability for whatever reason - and you are still using anycast ANH from all 4 ASBRs the packets which IGP sends towards said ASBRs would be either dropped or looped between ASBRs till TTL expires as next hop is still ANH so anycast. · · If RR advertise to other sites only one path and BGP NH is Site-Peer ANH as in this proposal, then Site-Peer ANH is not removed form IGP (as other ASBRs has session with Peer). Re,mote sites keep sending traffic using pre-failure data until BGP update from RR comes. End when it comes, it will have same BGP NH as pre-failure path. So there will be no need to update FIB. Also FIB structure will be simpler and less costly prefix--> one BGP NH --> list of IGP ECMP neighbours. Some merchant chips have really limited ECMP capability… See above. See when we worked on concept of virtual BGP paths we did a lot of analysis of this and reached the conclusion that while possible the application of given abstract or virtual next hop to consistent union of prefixes reachable over N EBGP sessions from single or multiple ASBRs must be automated as you can not assure eBGP symmetry. So even in your simplest case of ANH per ASBR per PeerAS there is zero guarantee that you get identical paths from all peers. That means that since you are going to keep the ANH in IGP till last session goes away you are going to attract traffic to such ASBRs until BGP withdrawn all affected non symmetrical paths. Trust me much faster would be not to set nh on ASBR and just remove peer's address in IGP in one shot. Then BGP can take however long it takes to "converge" without affecting any data plane. [RJS] If above is what operator want, it can simply assign ANH to eBGP session on 1:1 basis. We are not discussing now how to do this is CLI of given implementation. Some could be simple some more complex. I just say ANH is assigned to set of eBGP sessions, and nothing prevent operator to make set size == 1. Indeed in spatial case of 1:1 assignment, use of peers interface IP ( w/ /32 mask) is logical choice. But this is up to AS operator. One big Cloud guys like to give their own IP even in this case, as they do not want IP form other AS address space in IGP. Many thx, R. THX R. R. 😊
_______________________________________________ GROW mailing list [email protected] https://www.ietf.org/mailman/listinfo/grow
