RE: WG review for draft-lapukhov-bgp-routing-large-dc

Petr Lapukhov Tue, 29 Jul 2014 22:11:34 -0700

Jeff,

Sure thing, some of the reasoning is provided in the document, let me elaborate 
specifically on the questions you have asked.

On the eBGP vs iBGP topic. There have been some implementation constraints we 
bumped into when we considered iBGP. For example, if route-reflection is 
employed, then using ECMP for anycast prefixes (typically announced by 
load-balancer) could get you in trouble, since all implementations we had on 
hands did not consider cluster list length when picking up ECMP-equivalent 
paths. This could be worked around by applying a policy to enforce tie breaking 
(e.g. prefer downstream routes over upstream), but that means a bit more 
complexity in the configuration. There have been a few other issues (e.g. iBGP 
might try to use default route to establish peering even if the directly 
connected subnet is not reachable). This being said, I'm aware of iBGP based 
deployments and there has been work to document those. For the most part, 
routing mechanics and convergence is the same in both designs.

It is possible to use 'hybrid' approach, e.g. use iBGP between Tier-2 and 
Tier-3 switches, with each 'cluster' wrapped in its own ASN. There are 
deployments that use this approach as well, and it works just fine - same 
routing properties. With the design described in the document, we eventually 
turned the hybrid option down as it looked less uniform in terms of BGP 
features used. Potential benefit if eBGP *only* design is the ability to come 
up with more light-weight BGP implementation that only has the concept of eBGP 
sessions (less code). It could be a matter of operational preference, but 
having eBGP everywhere also allows to easily trace a particular prefix to its 
originating Tier-3 switch (though one can also use communities for that).

Route summarization is probably the most interesting aspect. We actively employ 
route summarization at FB, and document has some discussion around this topic. 
Put it short, with route summarization you need either of the following:

1) Creating a bypass path between Tier-2 switches (or any tier doing 
aggregation) - e.g. a ring interconnecting them together. This adds some 
concerns with managing the ring capacity and wasting ports on Tier-2 switches. 
You can see a sample topology that uses this design here (older FB design): 
http://nathanfarrington.com/papers/facebook-oic13.pdf
2) Use simple virtual aggregation when doing summarization and leaking, but not 
using, specifics - i.e. only installing specifics when some of the paths fail. 
This is mentioned in the document, but to my knowledge there is currently no 
widespread implementation for DC-class network devices.
3) Use intelligent dis-aggregation on Tier-3 switches - once a switch detects 
that one of its uplinks is dead, it starts announcing its prefixes with a BGP 
community that excludes it from summarization on upstream devices. Technically 
it is similar to (2), but relies purely on the control plane.
4) As opposed to using the bypass ring between Tier-2 devices, use paths via 
Tier-3 switches as alternates: this means allowing some of Tier-3 switches to 
re-adverte routes back to Tier-2. This has a problem with significantly 
increased routing and policy complexity.

Route summarization adds the benefit of reduced fault domain scopes (which 
could be very handy in preventing certain positive feedback loops) and reduced 
FIB footprint. The latter is especially important with IPv6 -  FB is dual-stack 
everywhere, and majority of internal traffic is IPv6, see 
http://www.swissipv6council.ch/sites/default/files/docs/day_2_7_paul_saab.pdf. 

MPLS is an interesting topic, mostly because it allows for simple and uniform 
data-plane and permits seamless handover b/w DC and backbone. There are also 
possible use-cases with virtual-aggregation. However complexity tradeoffs need 
better understanding (e.g. if going with encapsulation on the server). We 
currently have no experience of using it inside data-center, but that's an 
actively discussed topic.

I believe the major contribution of the document is debunking some fears 
associated with BGP (path hunting, MRAI etc) and demonstrating its feasibility 
as a routing protocol for hierarchical, dense, up-down routed network. The 
proposed design avoids the issues with path hunting and demonstrates 
convergence times well under one second (for single failure, full table loads 
make take longer with FIB update times) - mostly bounded by event propagation 
time + FIB update time. In case of ECMP, the failover is done in hardware, and 
with local bypass available (e.g. ring interconnecting Tier-2 switches) the 
repair can happen locally both for upstream and downstream traffic. 

Best regards,

Petr

____
From: Jeff Tantsura [[email protected]]
Sent: Tuesday, July 29, 2014 8:55 AM
To: Petr Lapukhov; Alia Atlas; Antoni Przygienda
Cc: [email protected]; [email protected]
Subject: Re: WG review for draft-lapukhov-bgp-routing-large-dc

Hi Petr,

While creating the blueprint you must have considered different options, eBGP 
vs iBGP, mixed, summarization (different places), MPLS over foo, etc
Would be great if you could elaborate why the design looks the way it is.

Thanks!

Cheers,
Jeff

From: Petr Lapukhov <[email protected]<mailto:[email protected]>>
Date: Tuesday, July 29, 2014 at 8:43 AM
To: Alia Atlas <[email protected]<mailto:[email protected]>>, Antoni Przygienda 
<[email protected]<mailto:[email protected]>>
Cc: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>, 
"[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>>
Subject: RE: WG review for draft-lapukhov-bgp-routing-large-dc

Hi Tony,

I do agree that the document is rather operational/blueprint in its nature. 
However, I think it's beneficial to have a practical example and start 
discussion around how large-scale routing in "dense" networks could be 
accomplished using existing routing protocols. This could be used as a factual 
ground in evaluation various new routing protocol designs for similar 
environments. As we progress, I hope to add more practical data (e.g. 
convergence times). There are some interesting theoretical aspects (e.g. using 
route summarization with simple virtual aggregation) that haven't been explored 
yet and could be an interesting discussion topic as well.

Thank you!

Petr

________________________________
From: rtgwg [[email protected]<mailto:[email protected]>] on behalf 
of Alia Atlas [[email protected]<mailto:[email protected]>]
Sent: Tuesday, July 29, 2014 8:23 AM
To: Antoni Przygienda
Cc: [email protected]<mailto:[email protected]>; 
[email protected]<mailto:[email protected]>
Subject: Re: WG review for draft-lapukhov-bgp-routing-large-dc

Hi Tony,

On Tue, Jul 29, 2014 at 10:45 AM, Antoni Przygienda 
<[email protected]<mailto:[email protected]>> wrote:
> hi antoni,
>
> i am all for accepting it as a WG item - IMO its an excellent proofpoint that
> really large datacenters can be run based on exisiting protocols and

[Tony said]  Hannes, yepp, first, work's interesting & fact that it works gives 
it its own merit, should be possibly taken up by someone IMO. My points were 
though:

        1. I didn't see 'running large datacenters' as something on RTGWG 
charter and it's also typically something that is first driven by larger set of 
requirements rather than a single datapoint.

As I said earlier, this falls into the other work and handling individual 
drafts without a home.  It's a way of getting the review and consensus for 
drafts that would otherwise be AD-sponsored.  I agree that this is a change 
from how rtgwg has been used in the past.

        2. A blueprint of a particular solution is exactly that.  It is not a 
generic protocol specification or guideline that will fi t e'one. If you have 
multi-TS which bring their own existing addresses or need MAC mobility or have 
to run L2 applications or don't have BGP implementation with necessary twists 
or other tid-nits which tons of DC happen to carry about then the shoe may not 
fit.

Absolutely - this is a starting point that gives one idea.

> even getting to 10000s of routing nodes is not the end of the world.
>
[Tony said]  We know that from a running thing called the 'Internet'  ;-P  I 
know I took it out the context but I couldn't resist the tongue-in-cheek pun 
possible ;-)

Again, looking fwd' to presentation and discussion on the floor.

Please - start the discussion here and now!  I was delighted to see how many 
people were clearly familiar with the work and thought it was a good idea to 
discuss.  Let's get some good reviews and suggestions so the draft can be much 
better before the next IETF.

Alia

--- tony

_______________________________________________
rtgwg mailing list
[email protected]<mailto:[email protected]>
https://www.ietf.org/mailman/listinfo/rtgwg<https://urldefense.proofpoint.com/v1/url?u=https://www.ietf.org/mailman/listinfo/rtgwg&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=4Cgrr4FbaWmFJmFxSc8F6Q%3D%3D%0A&m=p6%2BjkvZVeCkwqto%2FcW%2F0sFjBDtWiked%2BB7ujDChwB7s%3D%0A&s=459951d713904ffe2c9fba132e7c24489d25231fc72c0738fb0700c648cc1053>
_______________________________________________
rtgwg mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/rtgwg

RE: WG review for draft-lapukhov-bgp-routing-large-dc

Reply via email to