Hi Nick,

Thanks for your feedback! For your questions

1) In the proposed design, the network topology is symmetric and homogenous, 
and the default route is simply relayed from the "WAN-facing" border routers. 
It is only used to steer traffic to destinations outside of data-center, and 
the issues of "WAN default routing" are outside of the scope of the document. 
Keep in mind that the default route is supplied is in addition to the full 
routing information for all in-data-center destinations. The FIB problem is 
generally not an issue in DC's with modern merchant silicon switches, even with 
data-centers as large as 200K bare metal servers. However, if needed, simple 
virtual aggregation could take care of it, thanks to shared next-hop sets among 
large groups of prefixes.

2) Server virtualization is outside of the scope of the document. However, if 
required, overlaying techniques could be used, to isolate tenant IP 
addressing/signaling from the fabric. Hypervisors may participate in an overlay 
control plane, but we avoided this topic on purpose, since it's a much broader 
discussion scope. Let's put it this way - the proposed design is for the 
"underlay" bare metal network :)

3) The major idea for failure detection was relying on p2p nature of 
interconnection links and leveraging optical layer for failure detection. BFD 
could be an option, though many merchant silicon solutions do not support 
hardware generation of BFD packets, and as such they are not much different 
from various control-plane keep-alives, since we won't be sharing it across 
multiple upstream protocols. LACP and other technologies could be used to add 
an extra layer of health probing, though this does not change the whole 
fault-detection logic... 

Regards,

Petr

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Nick 
Hilliard
Sent: Sunday, September 8, 2013 5:51 AM
To: Jon Mitchell
Cc: [email protected]
Subject: Re: [GROW] comments on draft-lapukhov-bgp-routing-large-dc-06

On 06/09/2013 03:45, Jon Mitchell wrote:
> We would like to solicit any further comments on 
> draft-lapukhov-bgp-routing-large-dc-06.  Originally this draft was 
> presented by Petr in Vancouver in both IDR and GROW and we feel this 
> draft is useful to the IETF community as it documents a working 
> scalable design for building large DC's.

i think this draft is really interesting.  A couple of things come to mind:

- default route origination is a real pain in larger scale networks for obvious 
reasons, not least because you can often end up with non-optimal egress choice 
(e.g. the RR's choice of default route).  You also need to take special care to 
ensure that the scope of the default route is limited to only router sets that 
actually need it.  I prefer the idea of using a floating anycast default route. 
 i.e. you inject an anycast prefix from all or a selection of all routers which 
can handle default free traffic flow, then on small-fib routers you can install 
a static default pointing to this floating default and depend on recursive 
route lookup to figure out the best path (original idea from Saku Ytti - 
http://goo.gl/Nj69OZ).  This allows much tighter control of default route 
propagation, with better egress choice characteristics.

- in virtualised networks hosting third party tenancies, it is iften useful to 
extend L3 to the hypervisor.  With current tech, running thousands of vms per 
cabinet is not unrealistic, and this number will undoubtedly increase over 
time.  This brings up the issue of both how to handle address assignment in an 
optimal manner and also how to be able to assign one or more IP addresses per 
vm or vm cluster, without the problems associated with flat shared lans, 
without using NAT and its consequent problems, but also without wasting 
precious public IP addresses on link networks which typically have abysmal 
assignment efficiency (e.g. 50% for /30, 75% for /29, etc).  There are models 
out there which suggest using routing protocols, but BGP may not always be the 
best choice due to scalability
issues: if you have a ToR switch, it will be pretty limited in terms of the 
number of bgp sessions it can handle.  Some people have approached this by 
using RIP to the client VMs.  The advantage of this is that RIP is fully 
stateless, whereas BGP has session overhead.  On a small ToR box with an itty 
bitty RP, the overhead associated with high hundreds hundreds or even low 
thousands of BGP sessions may be too much.  I'm not sure if this is in the 
scope of this draft though.

- i think you skimp over the problems associated with bgp session failure 
detection / reconvergence.  Mandating ebgp will get rid of the problems 
associated with the traditional loopback-to-loopback configuration of most ibgp 
networks, but there are still situations where dead link detection is going to 
be necessary using some form of keepalive mechanism which works faster than 
ebgp timers.  It would be good to have a little more discussion about bfd - if 
we can point vendors to good use cases here, they will be more inclined to 
support bfd on tor boxes.

- typo: s/it's/its/g

Nick

_______________________________________________
GROW mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/grow
_______________________________________________
GROW mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/grow

Reply via email to