Robert, THX for picking it up. See inline. -- Rafal Szarecki
Each ASBR will propagate its best route to on-site RR That is precisely the moment when I think we need to seriously consider consequences. point #1 - more and more stuff in BGP is opaque to BGP and plays no role in best path selection. So if someone really needs any of those information he must not use this solution and that should be spelled out very clearly as this information will be lost. [RJS] Well, basic BGP rule is that only single path per prefix is advertised. Unless you enable ADD-PATH. Right? So Vanilla behavior of ASBR is select one best path and advertise it to RR. Quite commonly, BGP NH is changed to IP of ASBR’s loopback. In my solution is changed to IP od ANH. I do not see how proposed solution is going to hide information that are otherwise available. So let’s consider situation when I would like export all path from ASBR to some kind of “Route Controller” for purpose of analytics or EPE. If this controller is not my on-site RR, I can still do this w/o any problem via BMP or using ADD-PATH. Just do not set BGP NH to ANH in export policy. Finally let assume we want ADD-PATH from ASBR to RR, for whatever reason. In this case indeed BGP NH become important. In proposed solution value of BGP NH (ANH) depends on what session given path was learned. So is eBGP session1 has associated ANH1 nad eBGP session2 has associated ANH2, then ADD-PATH from ASBR to on-site RR will give both path with unique BGP NH values. Then let assume is eBGP session3 has associated ANH1. In this case indeed only path learned form session1 xor session3 will be advertised to RR (plus path from session2). Yes, some information is suppressed now. Operator can control what could be suppressed by associating same ANH with given set of eBGP session, or not. That is configurable. The corner case will be 1:1 mapping eBGP session to ANH, what is very similar to keeping BGP NH unchanged (peer IP). With notable difference that we can remove ANH from IGP regardless of interface state, base on eBGP session state. point #2 - what is the real problem we are solving ? Full table takes depending on the implementation of BGP anywhere from 300-450 MB of RAM. Extra path would be another 150 MB. This is all control plane memory so pretty cheap and happily fits any x86 box to be placed to act as RR. [RJS] I agree. Memory footprint at RR is not an issue. Convergence at scale is. Let assume at site 1 I have 4 ASBRS connected to AS_2 each with 1 sessions, and this ASBR learns 300k prefixes form AS_2 and all of then are best from each ASBR POV. So 300k path per ASBR, 300k pfx per ASBR, 1 path per prefix per ASBR. The RR gets 4 x 300k pfx with BGP NH set to ASBR1-2-3-4 loopbacks. And send it w/ ADD-PAT to on-site CR When eBGP session of one of ASBR (say ASBR1) fails, it has to withdraw 300k path from RR, and RR need to withdraw 300k path form CR. Untill this is done CR will keep sending ¼ oftraffic to ASBR1, and BGP NH == loopback is reachable. Now if ANH is used, CR sees 4 path per prefix with BGP NH == ANH1-2-3-4 respectively. When eBGP session of one of ASBR (say ASBR1) fails, it removes ANH1 form IGP and start to withdraw 300k path from RR, and RR need to withdraw 300k path form CR. As soon as CR sees IGP update (ANH1 unreachable) it can mark all 300k path that have BGP NH == ANH1 unusable. And stop forwarding to ASBR1. If CR runs BGP PIC EDGE it could be sub-second. The inter-site operation – advertising only one path w/ BGP NH representing “set of eBGP sessions from set of ASBRS” is just one more optimization. Let call this SP_ANH (Site-Peer ANH in contrast to above discussed ASBR-Peer ANH). * If RR advertise to other sites only one path and BGP NH is loopback of one of ASBRs (or ASBR-Peer ANH), then what is convergence in case of this ASBR failure? RR has to send 300k path w/ new BGP NH. Until this is done, remote sites will send traffic somewhere elsw. Not best egress site. * If RR advertise to other sites only all 4 path and BGP NH is loopback of one of ASBRs (or ASBR-Peer ANH), then IGP update removing this address will allow for quick restoration (as other 3 path are available everywhere). But in multi-path scenario, we just sreated 2-3 level of ECMP structure on remote BGP speakers: prefix--> (list of 4 BGP NH) --> each BGP NH --> list of IGP ECMP neighbours. That costly to manage in S/W and in HW * If RR advertise to other sites only one path and BGP NH is Site-Peer ANH as in this proposal, then Site-Peer ANH is not removed form IGP (as other ASBRs has session with Peer). Re,mote sites keep sending traffic using pre-failure data until BGP update from RR comes. End when it comes, it will have same BGP NH as pre-failure path. So there will be no need to update FIB. Also FIB structure will be simpler and less costly prefix--> one BGP NH --> list of IGP ECMP neighbours. Some merchant chips have really limited ECMP capability… Now you made two observations: -A- I have a weak router which is fine from bw pov but does not have steam to handle 5M paths - My take is - ok send him 1 or 2 paths from RRs and be done. -B- The withdraw of all BGP routes takes soo long - well let's observe that we can withdraw all routes from a given peer with single BGP message using techniques as described in https://tools.ietf.org/html/draft-raszuk-aggr-withdraw-00<https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Draszuk-2Daggr-2Dwithdraw-2D00&d=DwMFaQ&c=HAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=Hjhzvcy3RXY7GgnXtof0rgOeWXlbs83hVb3_12LdlBA&m=iJUN0T42tjxfPWsB6SlnCrzunCWTBLFel6FkdLOBPvU&s=zc62mnhXDVusW-Gi9_0csHmhhApgkVDLBhiCprjmzPY&e=> [RJS] ACK. We can always extend protocol to do something. Or develop new one. I think I also saw proposal of new NLRI to advertise “NH invalidation”. This proposal do not requires any change to BGP. So it can interoperate with virtually any implementations. Bottom line I think using Abstract Next Hop can help for specific SAFis in specific topologies to reduce the amount of control plane if that is ever an issue. Yes dealing with BGP control plane handling is implementation specific and some code does it more efficiently then other. [RJS] Agree. Draft is focused specific architecture/topology – scale-out peering – where operator want and assume ECMP of egress traffic among N x ASBRs existing at given site. Not that ANH has no other possible uses, but this draft is about this use case. Best, R.
_______________________________________________ GROW mailing list [email protected] https://www.ietf.org/mailman/listinfo/grow
