Jeff, Sure thing, some of the reasoning is provided in the document, let me elaborate specifically on the questions you have asked.
On the eBGP vs iBGP topic. There have been some implementation constraints we bumped into when we considered iBGP. For example, if route-reflection is employed, then using ECMP for anycast prefixes (typically announced by load-balancer) could get you in trouble, since all implementations we had on hands did not consider cluster list length when picking up ECMP-equivalent paths. This could be worked around by applying a policy to enforce tie breaking (e.g. prefer downstream routes over upstream), but that means a bit more complexity in the configuration. There have been a few other issues (e.g. iBGP might try to use default route to establish peering even if the directly connected subnet is not reachable). This being said, I'm aware of iBGP based deployments and there has been work to document those. For the most part, routing mechanics and convergence is the same in both designs. It is possible to use 'hybrid' approach, e.g. use iBGP between Tier-2 and Tier-3 switches, with each 'cluster' wrapped in its own ASN. There are deployments that use this approach as well, and it works just fine - same routing properties. With the design described in the document, we eventually turned the hybrid option down as it looked less uniform in terms of BGP features used. Potential benefit if eBGP *only* design is the ability to come up with more light-weight BGP implementation that only has the concept of eBGP sessions (less code). It could be a matter of operational preference, but having eBGP everywhere also allows to easily trace a particular prefix to its originating Tier-3 switch (though one can also use communities for that). Route summarization is probably the most interesting aspect. We actively employ route summarization at FB, and document has some discussion around this topic. Put it short, with route summarization you need either of the following: 1) Creating a bypass path between Tier-2 switches (or any tier doing aggregation) - e.g. a ring interconnecting them together. This adds some concerns with managing the ring capacity and wasting ports on Tier-2 switches. You can see a sample topology that uses this design here (older FB design): http://nathanfarrington.com/papers/facebook-oic13.pdf 2) Use simple virtual aggregation when doing summarization and leaking, but not using, specifics - i.e. only installing specifics when some of the paths fail. This is mentioned in the document, but to my knowledge there is currently no widespread implementation for DC-class network devices. 3) Use intelligent dis-aggregation on Tier-3 switches - once a switch detects that one of its uplinks is dead, it starts announcing its prefixes with a BGP community that excludes it from summarization on upstream devices. Technically it is similar to (2), but relies purely on the control plane. 4) As opposed to using the bypass ring between Tier-2 devices, use paths via Tier-3 switches as alternates: this means allowing some of Tier-3 switches to re-adverte routes back to Tier-2. This has a problem with significantly increased routing and policy complexity. Route summarization adds the benefit of reduced fault domain scopes (which could be very handy in preventing certain positive feedback loops) and reduced FIB footprint. The latter is especially important with IPv6 - FB is dual-stack everywhere, and majority of internal traffic is IPv6, see http://www.swissipv6council.ch/sites/default/files/docs/day_2_7_paul_saab.pdf. MPLS is an interesting topic, mostly because it allows for simple and uniform data-plane and permits seamless handover b/w DC and backbone. There are also possible use-cases with virtual-aggregation. However complexity tradeoffs need better understanding (e.g. if going with encapsulation on the server). We currently have no experience of using it inside data-center, but that's an actively discussed topic. I believe the major contribution of the document is debunking some fears associated with BGP (path hunting, MRAI etc) and demonstrating its feasibility as a routing protocol for hierarchical, dense, up-down routed network. The proposed design avoids the issues with path hunting and demonstrates convergence times well under one second (for single failure, full table loads make take longer with FIB update times) - mostly bounded by event propagation time + FIB update time. In case of ECMP, the failover is done in hardware, and with local bypass available (e.g. ring interconnecting Tier-2 switches) the repair can happen locally both for upstream and downstream traffic. Best regards, Petr ____ From: Jeff Tantsura [[email protected]] Sent: Tuesday, July 29, 2014 8:55 AM To: Petr Lapukhov; Alia Atlas; Antoni Przygienda Cc: [email protected]; [email protected] Subject: Re: WG review for draft-lapukhov-bgp-routing-large-dc Hi Petr, While creating the blueprint you must have considered different options, eBGP vs iBGP, mixed, summarization (different places), MPLS over foo, etc Would be great if you could elaborate why the design looks the way it is. Thanks! Cheers, Jeff From: Petr Lapukhov <[email protected]<mailto:[email protected]>> Date: Tuesday, July 29, 2014 at 8:43 AM To: Alia Atlas <[email protected]<mailto:[email protected]>>, Antoni Przygienda <[email protected]<mailto:[email protected]>> Cc: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>>, "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: RE: WG review for draft-lapukhov-bgp-routing-large-dc Hi Tony, I do agree that the document is rather operational/blueprint in its nature. However, I think it's beneficial to have a practical example and start discussion around how large-scale routing in "dense" networks could be accomplished using existing routing protocols. This could be used as a factual ground in evaluation various new routing protocol designs for similar environments. As we progress, I hope to add more practical data (e.g. convergence times). There are some interesting theoretical aspects (e.g. using route summarization with simple virtual aggregation) that haven't been explored yet and could be an interesting discussion topic as well. Thank you! Petr ________________________________ From: rtgwg [[email protected]<mailto:[email protected]>] on behalf of Alia Atlas [[email protected]<mailto:[email protected]>] Sent: Tuesday, July 29, 2014 8:23 AM To: Antoni Przygienda Cc: [email protected]<mailto:[email protected]>; [email protected]<mailto:[email protected]> Subject: Re: WG review for draft-lapukhov-bgp-routing-large-dc Hi Tony, On Tue, Jul 29, 2014 at 10:45 AM, Antoni Przygienda <[email protected]<mailto:[email protected]>> wrote: > hi antoni, > > i am all for accepting it as a WG item - IMO its an excellent proofpoint that > really large datacenters can be run based on exisiting protocols and [Tony said] Hannes, yepp, first, work's interesting & fact that it works gives it its own merit, should be possibly taken up by someone IMO. My points were though: 1. I didn't see 'running large datacenters' as something on RTGWG charter and it's also typically something that is first driven by larger set of requirements rather than a single datapoint. As I said earlier, this falls into the other work and handling individual drafts without a home. It's a way of getting the review and consensus for drafts that would otherwise be AD-sponsored. I agree that this is a change from how rtgwg has been used in the past. 2. A blueprint of a particular solution is exactly that. It is not a generic protocol specification or guideline that will fi t e'one. If you have multi-TS which bring their own existing addresses or need MAC mobility or have to run L2 applications or don't have BGP implementation with necessary twists or other tid-nits which tons of DC happen to carry about then the shoe may not fit. Absolutely - this is a starting point that gives one idea. > even getting to 10000s of routing nodes is not the end of the world. > [Tony said] We know that from a running thing called the 'Internet' ;-P I know I took it out the context but I couldn't resist the tongue-in-cheek pun possible ;-) Again, looking fwd' to presentation and discussion on the floor. Please - start the discussion here and now! I was delighted to see how many people were clearly familiar with the work and thought it was a good idea to discuss. Let's get some good reviews and suggestions so the draft can be much better before the next IETF. Alia --- tony _______________________________________________ rtgwg mailing list [email protected]<mailto:[email protected]> https://www.ietf.org/mailman/listinfo/rtgwg<https://urldefense.proofpoint.com/v1/url?u=https://www.ietf.org/mailman/listinfo/rtgwg&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=4Cgrr4FbaWmFJmFxSc8F6Q%3D%3D%0A&m=p6%2BjkvZVeCkwqto%2FcW%2F0sFjBDtWiked%2BB7ujDChwB7s%3D%0A&s=459951d713904ffe2c9fba132e7c24489d25231fc72c0738fb0700c648cc1053> _______________________________________________ rtgwg mailing list [email protected] https://www.ietf.org/mailman/listinfo/rtgwg
