Re: [GROW] Review request for draft-szarecki-grow-abstract-nh-scaleout-peering-00

Rafal Jan Szarecki Thu, 28 Feb 2019 16:46:20 -0800

Robert,

THX for picking it up.
See inline.
--
Rafal Szarecki

Each ASBR will propagate its best route to on-site RR

That is precisely the moment when I think we need to seriously consider
consequences.

point #1 - more and more stuff in BGP is opaque to BGP and plays no role in
best path selection. So if someone really needs any of those information he
must not use this solution and that should be spelled out very clearly as this
information will be lost.

[RJS] Well, basic BGP rule is that only single path per prefix is advertised.
Unless you enable ADD-PATH. Right? So Vanilla behavior of ASBR is select one
best path and advertise it to RR. Quite commonly, BGP NH is changed to IP of
ASBR’s loopback. In my solution is changed to IP od ANH. I do not see how
proposed solution is going to hide information that are otherwise available.

So let’s consider situation when I would like export all path from ASBR to some
kind of “Route Controller” for purpose of analytics or EPE. If this controller
is not my on-site RR, I can still do this w/o any problem via BMP or using
ADD-PATH. Just do not set BGP NH to ANH in export policy.

Finally let assume we want ADD-PATH from ASBR to RR, for whatever reason. In
this case indeed BGP NH become important. In proposed solution value of BGP NH
(ANH) depends on what session given path was learned. So is eBGP session1 has
associated ANH1 nad eBGP session2 has associated ANH2, then ADD-PATH from ASBR
to on-site RR will give both path with unique BGP NH values.
Then let assume is eBGP session3 has associated ANH1. In this case indeed only
path learned form session1 xor session3 will be advertised to RR (plus path
from session2). Yes, some information is suppressed now. Operator can control
what could be suppressed by associating same ANH with given set of eBGP
session, or not. That is configurable. The corner case will be 1:1 mapping eBGP
session to ANH, what is very similar to keeping BGP NH unchanged (peer IP).
With notable difference that we can remove ANH from IGP regardless of interface
state, base on eBGP session state.

point #2 - what is the real problem we are solving ? Full table takes depending
on the implementation of BGP anywhere from 300-450 MB of RAM. Extra path would
be another 150 MB. This is all control plane memory so pretty cheap and happily
fits any x86 box to be placed to act as RR.

[RJS] I agree. Memory footprint at RR is not an issue. Convergence at scale is.
Let assume at site 1 I have 4 ASBRS connected to AS_2 each with 1 sessions, and
this ASBR learns 300k prefixes form AS_2 and all of then are best from each
ASBR POV. So 300k path per ASBR, 300k pfx per ASBR, 1 path per prefix per ASBR.
The RR gets 4 x 300k pfx with BGP NH set to ASBR1-2-3-4 loopbacks. And send it
w/ ADD-PAT to on-site CR
When eBGP session of one of ASBR (say ASBR1) fails, it has to withdraw 300k
path from RR, and RR need to withdraw 300k path form CR. Untill this is done CR
will keep sending ¼ oftraffic to ASBR1, and BGP NH == loopback is reachable.
Now if ANH is used, CR sees 4 path per prefix with BGP NH == ANH1-2-3-4
respectively. When eBGP session of one of ASBR (say ASBR1) fails, it removes
ANH1 form IGP and start to withdraw 300k path from RR, and RR need to withdraw
300k path form CR. As soon as CR sees IGP update (ANH1 unreachable) it can mark
all 300k path that have BGP NH == ANH1 unusable. And stop forwarding to ASBR1.
If CR runs BGP PIC EDGE it could be sub-second.

The inter-site operation – advertising only one path w/ BGP NH representing
“set of eBGP sessions from set of ASBRS” is just one more optimization. Let
call this SP_ANH (Site-Peer ANH in contrast to above discussed ASBR-Peer ANH).

* If RR advertise to other sites only one path and BGP NH is loopback of
one of ASBRs (or ASBR-Peer ANH), then what is convergence in case of this ASBR
failure? RR has to send 300k path w/ new BGP NH. Until this is done, remote
sites will send traffic somewhere elsw. Not best egress site.
* If RR advertise to other sites only all 4 path and BGP NH is loopback of
one of ASBRs (or ASBR-Peer ANH), then IGP update removing this address will
allow for quick restoration (as other 3 path are available everywhere). But in
multi-path scenario, we just sreated 2-3 level of ECMP structure on remote BGP
speakers:
prefix--> (list of 4 BGP NH) --> each BGP NH --> list of IGP ECMP neighbours.
That costly to manage in S/W and in HW
* If RR advertise to other sites only one path and BGP NH is Site-Peer ANH
as in this proposal, then Site-Peer ANH is not removed form IGP (as other ASBRs
has session with Peer). Re,mote sites keep sending traffic using pre-failure
data until BGP update from RR comes. End when it comes, it will have same BGP
NH as pre-failure path. So there will be no need to update FIB. Also FIB
structure will be simpler and less costly
prefix--> one BGP NH --> list of IGP ECMP neighbours.
Some merchant chips have really limited ECMP capability…

Now you made two observations:

-A- I have a weak router which is fine from bw pov but does not have steam to
handle 5M paths - My take is - ok send him 1 or 2 paths from RRs and be done.

-B- The withdraw of all BGP routes takes soo long - well let's observe that we
can withdraw all routes from a given peer with single BGP message using
techniques as described in
https://tools.ietf.org/html/draft-raszuk-aggr-withdraw-00<https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_draft-2Draszuk-2Daggr-2Dwithdraw-2D00&d=DwMFaQ&c=HAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=Hjhzvcy3RXY7GgnXtof0rgOeWXlbs83hVb3_12LdlBA&m=iJUN0T42tjxfPWsB6SlnCrzunCWTBLFel6FkdLOBPvU&s=zc62mnhXDVusW-Gi9_0csHmhhApgkVDLBhiCprjmzPY&e=>

[RJS] ACK. We can always extend protocol to do something. Or develop new one. I
think I also saw proposal of new NLRI to advertise “NH invalidation”.
This proposal do not requires any change to BGP. So it can interoperate with
virtually any implementations.

Bottom line I think using Abstract Next Hop can help for specific SAFis in
specific topologies to reduce the amount of control plane if that is ever an
issue. Yes dealing with BGP control plane handling is implementation specific
and some code does it more efficiently then other.

[RJS] Agree. Draft is focused specific architecture/topology – scale-out
peering – where operator want and assume ECMP of egress traffic among N x ASBRs
existing at given site.
Not that ANH has no other possible uses, but this draft is about this use case.

Best,
R.

_______________________________________________
GROW mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/grow

Re: [GROW] Review request for draft-szarecki-grow-abstract-nh-scaleout-peering-00

Reply via email to