Joel,

I see your points. Please see my explanation below quoted by <ld> </ld>.



From: Joel Halpern <[email protected]>
Sent: Monday, August 21, 2023 11:34 PM
To: Linda Dunbar <[email protected]>
Cc: rtgwg-chairs <[email protected]>; 
[email protected]; [email protected]
Subject: Re: Need your help to make sure the 
draft-ietf-rtgwg-net2cloud-problem-statement readability is good.


Thank you Linda.  Trimmed the agreements, including acceptable text from your 
reply.  Leaving the two points that can benefit from a little more tuning.
Marked <jmh2></jmh2>
Yours,
Joel
On 8/22/2023 12:12 AM, Linda Dunbar wrote:


Similarly, section 3.2 looks like it could apply to any operator.  The 
reference to the presence or absence of IGPs seems largely irrelevant to the 
question of how partial failures of a facility are detected and dealt with.
[Linda] Two reasons that the site failure described in Section 3.2 do not apply 
to other networks:
1.      One DC can have many server racks concentrated in a small area which 
can fail by one single event. Vs. Regular network failure at one location only 
impact the routers at the location, which quickly triggers the services 
switched to the protection paths.
2.      Regular networks run IGP, which can propagate inner fiber cut failures 
quickly to the edge. While as many DCs don’t run IGP.
<jmh>Given that even a data center has to deal with internal failures, and that 
even traditional ISPs have to deal with partitioning failures, I don't think 
the distinction you are drawing in this section really exists.  If it does, you 
need to provide stronger justification.  Also, not all public DCs have chosen 
to use just BGP, although I grant that many have. I don't think you want to 
argue that the folks who have chosen to use BGP are wrong.  </jmh>

<ld> Are you referring to Network-Partitioning Failures in Cloud Systems?
Traditional ISPs don’t host end services; they are responsible for transporting 
packets;  therefore protection path can reroute packets . But Cloud DC site/PoD 
failure causing all the hosts (prefixes) no longer reachable </ld>
<jmh2> If a DC Site fails, the services failed too.  Yes, the DC operator has 
to reinstantiate them.  But that is way outside our scope.  To the degree that 
they can recover by rerouting to other instances (whether using anycast or some 
other trick) it looks just like routing around failures in other case, which 
BGP and IGPs can do.  I am still not seeing how this justifies any special 
mechanisms. </jmh2>
<ld>
You are correct that the protection is the same as the regular ISP networks.

The paragraph is intended to say the following:
      When a site failure occurs, many instances can be impacted. When the 
impacted instances’ IP prefixes in a Cloud DC are not aggregated nicely, which 
is very common, one single site failure can trigger a huge number of BGP UPDATE 
messages. Instead of many BGP UPDATE messages to the ingress routers for all 
the instances impacted, [METADATA-PATH] proposes one single BGP UPDATE 
indicating the site failure. The ingress routers can switch all the instances 
that are associated with the site.

</ld>



_______________________________________________
rtgwg mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/rtgwg

Reply via email to