Opsawg,

Section 3.2 of 
https://datatracker.ietf.org/doc/draft-ietf-rtgwg-net2cloud-problem-statement/  
describes the Cloud Site failure impact to traffic to/from the enterprises' 
workloads hosted in Cloud DCs.

We really appreciate your feedback to this description.

----------
3.2. Site failures and Methods to Minimize Impacts

Site failures include, but not limited to, a site capacity degradation or 
entire site going down caused by a variety of reasons, such as fiber cut 
connecting to the site or among pods within the site, cooling failures, 
insufficient backup power, cyber threats attacks, too many changes outside of 
the maintenance window, etc. Fiber-cut is not uncommon within a Cloud site or 
between sites.
As described in RFC7938, Cloud DC BGP might not have an IGP to route around 
link/node failures within the ASes.
When those failure events happen, the Cloud DC GW which is visible to clients 
are running fine. Therefore, the Client GW can't use BFD to detect the failures.
When a site capacity degrades or goes dark, there are massive numbers of routes 
needing to be changed.
The large number of routes switching over to another site can also cause 
overloading that triggers more failures.
In addition, the routes (IP addresses) in a Cloud DC cannot be aggregated 
nicely, triggering very large number of BGP UPDATE messages when a failure 
occurs.
It might be more effective to do mass reroute, similar to EVPN [RFC7432] 
defined mass withdraw mechanism to signal a large number of routes being 
changed to remote PE nodes as quickly as possible.
-------------------------------------
Thank you very much
Linda Dunbar

_______________________________________________
rtgwg mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/rtgwg

Reply via email to