Re: [GROW] Review request for draft-szarecki-grow-abstract-nh-scaleout-peering-00

Rafal Jan Szarecki Mon, 04 Mar 2019 09:47:25 -0800

Robert,

--
Rafal Szarecki

From: Robert Raszuk <[email protected]>
Date: Friday, March 1, 2019 at 7:50 AM
To: Rafal Szarecki <[email protected]>, Natrajan Venkataraman 
<[email protected]>
Cc: “[email protected]” <[email protected]>
Subject: Re: [GROW] Review request for 
draft-szarecki-grow-abstract-nh-scaleout-peering-00

Hey Rafał,

Just do not set BGP NH to ANH in export policy.

In your Junos cfg example (slide 28) ANH is applied to EBGP peers so I am not 
sure how I can send something different to any IBGP peer:

[edit protocols bgp group PeerAS2]
type external;
egress-te {
install-address 1.1.1.2;
rib {
inet.0;
}
}
peer-as 2;
neighbor 11.1.1.1;
neighbor 11.1.1.5;

[RJS] above config conditionally install ANH (1.1.1.2) in RIB (inet.0 in JUNOS) 
depending of state of two sessions with OR logic.
On same slide 
(http://www.cvent.com/events/nanog-75/custom-116-948222eca5834bc2b7a679399063e724.aspx)
 you have policy “SET-AP-ANH” atha set next-hop to 1.1.1.2 and is attached as 
export policy to group “to-RR”

[protocols bgp group to-RR]
    export [ ... SET-AP-ANH ]
    neighbor 11.0.0.6

[policy-options policy-statement SET-AP-ANH term 10]
from {
    as-path fromAS2;
    }
then {
    next-hop 1.1.1.2;
    }

[RJS] I agree. Memory footprint at RR is not an issue. Convergence at scale is.

Convergence is not a problem .. in fact no one should be counting on protocol 
*convergence* for fast connectivity restoration at any event of failure these 
days.
[RJS] Let me rephrase connectivity restoration is a problem. To do this fast we 
need:

  1.  Backup information available on all devices across network. Ideally 
pre-programed in data-plane
  2.  Ability to switch multiple 1000’s of prefixes form primary to backup as 
result on ONE event (unreachability of BGP NH address == BGP PIC). To exercise 
this BGP NG may not be ASBR loopback as it stays in IGP as long as ASBR is UP.

Let assume at site 1 I have 4 ASBRS connected to AS_2 each with 1 sessions, and 
this ASBR learns 300k prefixes form AS_2 and all of then are best from each 
ASBR POV. So 300k path per ASBR, 300k pfx per ASBR, 1 path per prefix per ASBR.
The RR gets 4 x 300k pfx with BGP NH set to ASBR1-2-3-4 loopbacks. And send it 
w/ ADD-PAT to on-site CR
When eBGP session of one of ASBR (say ASBR1) fails, it has to withdraw 300k 
path from RR, and RR need to withdraw 300k path form CR. Untill this is done CR 
will keep sending ¼ oftraffic to ASBR1, and BGP NH == loopback is reachable.
Now if ANH is used, CR sees 4 path per prefix with BGP NH == ANH1-2-3-4 
respectively. When eBGP session of one of ASBR (say ASBR1) fails, it removes 
ANH1 form IGP and start to withdraw 300k path from RR, and RR need to withdraw 
300k path form CR. As soon as CR sees IGP update (ANH1 unreachable) it can mark 
all 300k path that have BGP NH == ANH1 unusable. And stop forwarding to ASBR1. 
If CR runs BGP PIC EDGE it could be sub-second.

All of this works fine today out of the box if you do not set next hop self on 
your ASBRs (typical cfg in non MPLS networks :).
[RJS] I have impression that NHS is quite popular in non-MPLS as well. I heard 
it for few customers:

  *   Less losses if external interface flaps and backup exist (typical ASBR 
has bunch of peers + 1-2 transit/uplink providers)
  *   Peer address may belong to peers IP space. Caring it inside IGP is 
perceived as operationally problematic - harder to audit/secure network. Not 
that I personally buy it fully, but I see the point.

And if you do set nhp there as mentioned you can control when it is 
redistributed into IGP (or removed from it) by tracking any object (or set of 
objects) you define. Very simple.

[RJS] I do not get this. You mean set BGP NH to peer_IP or loopback?

  *   Peer_IP –  There is more than one way to skin the cat. To get this 
address into IGP you can:

     *   Configure inter-AS interface under IGP w/ passive option. Then if 
“object” is tracked, this configuration need to be altered in case of object is 
unavailable/fault-state.
     *   Redistribute  interface subnet to IGP via policy/route-map. Then if 
“object” is tracked, this policy configuration need to be altered in case of 
object is unavailable/fault-state.
     *   The issue with Peer_IP is that it is inherently 1:1 assotiated with 
session, so if I want to group sessions (see slide 14, where ASBR1.1 has 2 
link/sessions to AS2, and slide 12 which explain that peer_AS may have also 
scale-out topology w/ multiple ASBR @ site) and remove BGP NH address form IGP 
if given ASBR loose all sessions in group?

  *   Loopback_IP

     *   Let say we may have multiple IP on loopback (or multiple loopbacks) 
and IP1 used for managing is different from IP2 used to source iBGP sessions 
which is different form IP3-IPn used to be BGP NH. This way, we may 
redistribute  IP3-IPn to IGP conditionally w/ tracking of some “object”
     *   This allows me 1:1 as well N:1 assotiation of eBGP sessions to BGP NH
     *   And this is just another way of implementing ANH. It is exactly the 
same

  *   The question now what is tracked “object” and how we can do this locally.

     *   In draft we track state and EoR of set of sessions. So if session is 
ESTABLISHED but still in initial learning, ANH is not in IGP. This is to 
prevent:

        *   Prevent churn when max-prefix is enabled on session and is 
exceeded. To session will be turn down.
        *   Give some delay to data-plane be programmed before make given ASBR 
available exit for rest of network.

     *   The other option could be waiting for first BGP keep-a-live after 
session enter ESTABL state. Yet another will be monitor some “lead” prefix we 
know could be learned over sesession from set associated with given ANH only.  
This is local implementation. In Junos we choose one way, we believe is good, 
but other implementation can do this it own way (e.g. I know IOS-XR monitors 
1st keep-a-live)

 The inter-site operation – advertising only one path w/ BGP NH representing 
“set of eBGP sessions from set of ASBRS” is just one more optimization. Let 
call this SP_ANH (Site-Peer ANH in contrast to above discussed ASBR-Peer ANH).

· If RR advertise to other sites only one path and BGP NH is loopback of one of 
ASBRs (or ASBR-Peer ANH), then what is convergence in case of this ASBR 
failure?  RR has to send 300k path w/ new BGP NH. Until this is done, remote 
sites will send traffic somewhere elsw. Not best egress site.
Advertise not one but two paths each coming from different ASBRs.
[RJS] This will work as well to a degree. It obviously requires ADD-PATH among 
sites.

  *   what if I have 8 ASBRs and by bad luck 2 of them have lost their sessions 
– the 2 RR choose to re-advertise? I have still 6 ASBR capable to send traffic 
out, but RR need to remove BGP NH form IGP for both of them. And re-advertise 
all path with yet another BGP NH. As result traffic will be lost or sent to 
sub-optimal exit by time needed for BGP to converge.
  *   The other issue is that

·

· If RR advertise to other sites only all 4  path and BGP NH is loopback of one 
of ASBRs (or ASBR-Peer ANH), then IGP update removing this address will allow 
for quick restoration (as other 3 path are available everywhere). But in 
multi-path scenario, we just sreated 2-3 level of ECMP structure on remote BGP 
speakers:
prefix--> (list of 4 BGP NH) --> each BGP NH --> list of IGP ECMP neighbours. 
That costly to manage in S/W and in HW

If you advertise ANH per ASBR you will still have 4 different next hops so 
exactly the case as above.

[RJS] No. My RR is doing:

  1.  Toward CR and ASBR in same site: ADD-PATH with BGP NH unchanged (per ASBR 
ANH)
  2.  Toward other sites – another BGP NH replacement w/ Site-Peer ANH 
(Area-Wide)
This way other sites sees BGP NH as it would be anycast for all ASBRs. But 
inter-site path have ASBR-specific BGP NH. There is hierarchy of abstraction.
When packet arrives on CR of egress site, there is N path with per-ASBR 
specific NH.

One may ask what if MPLS is used and CR is LSR only. I this case BGP 
advertisement and BGP NH manipulation is exactly same as above. However 
Inter-site LSP FEC is Site-Peer ANH, an it must be inserted to IGP/LDP from 
some node that has full routing – let say from all ASBR (as anycast). It is in 
addition to ASBR-Peer ANH insertion to IGP.
In this case traffic may indeed leave tunnel on ASBR, hows best route is not 
external (due to asymmetry you described below and discussed in draft). However 
as ASBRs get routes form other ASBRs of same site from RR with ASBR-Peer ANH 
(ASBR specific) it will U-turn traffic to correct ASBR via intra-site 
infrastructure. It is suboptimal routing, but not cause loop. It is tradeoff – 
Full IP lookup on CR (or possibly “spine”/aggregation device sitting between 
ASBR and CR) or intra-site suboptimal routing but only ASBRs need to have Full 
IP FIB.

I do not pretend that this solution is holy grail of networking and solves 
world hunger problem. It is Just a tool/solution that may help, and network 
owner is to judge if it serves him well.

If you however proposing to advertise all paths from all ASBRs with the same 
next hop (in your case AS-WIDE ANH) - brilliant - but how are you assuring that 
all EBGP peers send you symmetric routes ? If you get some EBGP sessions giving 
you partial BGP reachability for whatever reason - and you are still using 
anycast ANH from all 4 ASBRs the packets which IGP sends towards said ASBRs 
would be either dropped or looped between ASBRs till TTL expires as next hop is 
still ANH so anycast.

·

· If RR advertise to other sites only one path and BGP NH is Site-Peer ANH as 
in this proposal, then Site-Peer ANH is not removed form IGP (as other ASBRs 
has session with Peer). Re,mote sites keep sending traffic using pre-failure 
data until BGP update from RR comes. End when it comes, it will have same BGP 
NH as pre-failure path. So there will be no need to update FIB. Also FIB 
structure will be simpler and less costly
prefix--> one  BGP NH --> list of IGP ECMP neighbours.
Some merchant chips have really limited ECMP capability…
See above.

See when we worked on concept of virtual BGP paths we did a lot of analysis of 
this and reached the conclusion that while possible the application of given 
abstract or virtual next hop to consistent union of  prefixes reachable over N 
EBGP sessions from single or multiple ASBRs must be automated as you can not 
assure eBGP symmetry.

So even in your simplest case of ANH per ASBR per PeerAS there is zero 
guarantee that you get identical paths from all peers. That means that since 
you are going to keep the ANH in IGP till last session goes away you are going 
to attract traffic to such ASBRs until BGP withdrawn all affected non 
symmetrical paths. Trust me much faster would be not to set nh on ASBR and just 
remove peer's address in IGP in one shot. Then BGP can take however long it 
takes to "converge" without affecting any data plane.

[RJS] If above is what operator want, it can simply assign ANH to eBGP session 
on 1:1 basis. We are not discussing now how to do this is CLI of given 
implementation. Some could be simple some more complex. I just say ANH is 
assigned to set of eBGP sessions, and nothing prevent operator to make set size 
== 1.
Indeed in spatial case of 1:1 assignment, use of peers interface IP ( w/ /32 
mask) is logical choice. But this is up to AS operator. One big Cloud guys like 
to give their own IP even in this case, as they do not want IP form other AS 
address space in IGP.

Many thx,
R.

THX R.

R. 😊

_______________________________________________
GROW mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/grow

Re: [GROW] Review request for draft-szarecki-grow-abstract-nh-scaleout-peering-00

Reply via email to