Eric, I'd just like to provide some feedback to your comments... no opinion on the draft. Please see inline.
On Feb 7, 2014, at 9:50 AM, Eric Rosen <[email protected]> wrote: > I have a few questions and concerns about draft-xu-l3vpn-virtual-subnet-03. > > - Section 3.3: > > PE routers SHOULD be able to discover their local CE hosts and keep > the list of these hosts up to date in a timely manner so as to ensure > the availability and accuracy of the corresponding host routes > originated from them. > > Surely this is a MUST. I don't see how the scheme can work without a > responsive and reliable discovery mechanism of some sort. It is quite likely that the authors are assuming that there is a management system that creates the "guest" (CE) and assigns them to a specific host (PE). That management system has to for instance spawn the guest instance and is aware of the instances network requirements. I don't believe that this document should specify the API used by the management system to inform the networking component on the host system. > > Since the draft does not require any particular discovery scheme, perhaps > it should at least characterize the set of acceptable schemes. > > - Is a PE supposed to discover all the local hosts, and originate a host > route into BGP for each one of them? Or are host routes originated only > for a subset of the local hosts? > > I don't see anything in the draft that says how to choose a subset. > However, it seems like in the intended use case, the hosts are VMs, and > the draft says that a data center can contain millions of VMs. Is each PE > going to originate host routes for millions of VMs? In the context of the document, i believe that a PE is considered to be the host. But to your point, we have proof of existence that for a BGP speaker to advertise millions of BGP L3VPN routes is quite feasible. RRs in currently L3VPN deployments do this all the time. And L3VPN carrier networks have a higher number of reachable routes than most DCs have VMs using the tools available today. > > If so, I don't understand why the scheme is claimed to be scalable. A > solution that relies on millions of BGP-distributed host routes might be > expected to exhibit some scaling problems having to do with > routing/forwarding table size. (Note that section 3.9 proposes to > distribute host routes not only to other DCs, but to "cloud user sites", > as well.) The routing / table forwarding sizes in traditional PEs have limitations that are driven by their design. For instance a PE designed to process 1T of traffic will typically have to chose a hardware architecture that limits the forwarding tables to single digit megabytes of very fast memory. A server that has to process 10/20G of traffic can probably afford to have its forwarding tables in DRAM. If you allocate 1% of the memory of a modern server to forwarding tables that would be 1.9G. Scaling is relative. L3VPN as it is deployable today in carrier networks would not be possible with a device like the cisco AGS+. We have all heard forever that L3VPN doesn't scale... yet that doesn't seem to deter carriers from using it to provide services to their customers. > > The draft talks about the increased path optimality that one gains from > using host routes. Well, everyone knows that you get more optimal routing > with host routes, but the Internet doesn't run on host routes because of > the scaling issues. > > > - I wondered originally whether the intention is that host routes are > distributed only in the exception cases, where a VM moves off its "native" > subnet. But the draft doesn't seem to say anything like that. It seems > rather to be eliminating the traditional notion of a localized subnet, and > then discussing how to "fool" the hosts into thinking that the localized > subnets still exist. But this raises the question of whether the draft > discusses everything that might possibly break. For instance, will DHCP > still work? DHCP deployments typically depend on a DHCP relay rather than pure L2 reachability. Having a distributed DHCP relay is an option. > > Probably the answer is going to be "anything that doesn't work any more > isn't needed in the DC environment". Maybe the draft just needs to state > its applicability restrictions more clearly. > > - To provide good scaling, one needs to consider not only the number of VMs, > but the rate of movement. How many VMs per second move from one DC to > another, how many VMs per second are created, how many destroyed? The average rates are going to be measured in VM changes per minute or tens of minutes. In smal clusters (which is where most tools are today) there just aren't that many VMs nor do they change that frequently. Very very large clusters using proprietary orchestration systems that have been optimized for dynamic workloads will schedule about a dozen VMs per minute when very busy. This last case is for people that do their entire application stack and have very dynamic applications and are about 10+ years ahead of the general purpose market. Let me put it to you in another way: for a VM to be spawn, there needs to be a very chattery XML/json exchange to instruct the host to go fetch a large image (Gigabytes) of data and then start that VM. To move it is twice the cost. A route update is a very very tiny effort compared to all of this even if you do not write your orchestration code in interpreted languages with no concurrency. In general the scale arguments that tend to come up are orders of magnitude away from reality. > These > rates will have considerable impact on the control plane. This issue > isn't even mentioned in the draft. > > - If a PE originates a host route, I don't see anything in the draft that > will cause the host route to time out and be withdrawn if the host > disappears. (There is discussion of what to do if the host shows up > somewhere else, but I didn't see any discussion of what to do if the host > just disappears altogether.) Surely a scheme based on host routes for > movable hosts needs some sort of 'garbage collection'. The "garbage collection" is the responsibility of the orchestration system. > > - The draft suggests that if a PE, say PE1, has originated a host route for > host H, and then PE1 sees a host route for H from another PE, say PE2, > that PE1 should try to figure out whether H is still local, and withdraw > the route if it concludes that H is no longer local. That would be unnecessary. > > I believe this presumes that all VRFs have unique RDs; that should be > stated. (Otherwise a route reflector might not forward all the routes.) > > Suppose PE1 sees a host route for H from PE2, but PE1 then concludes that > H is still local. Is the local route to be considered preferable? Does > it install the BGP route from PE2, but not issue the proxy ARP responses? > The draft should state the procedures for this case. > > What if there is a local BGP route for PE2, (say, from a CE router), but > the BGP decision process chooses the remote route? > > - It seems to me that the scheme does not work at all if a single site is > attached to two PEs, UNLESS those PEs negotiate some sort of > primary/secondary relationship. > > The draft does mention this: > > "In the scenario where a given VPN site (i.e., a data > center) is multi-homed to more than one PE router via an > Ethernet switch or an Ethernet network, Virtual Router > Redundancy Protocol (VRRP) [RFC5798] is usually enabled on > these PE routers. In this case, only the PE router being > elected as the VRRP Master is allowed to perform the > ARP/ND proxy function." > > But I'm not sure what to make of the "usually". The draft does not > say that its applicability is restricted to the cases where either (a) a > site attaches only to a single PE, or (b) the site attaches to two PEs > that are running VRRP with each other. So we need to examine what will > happen if the site attaches to two PEs that are not running VRRP. > > Suppose Site-1 has Host H-1, and attaches to PE-11 and PE-12. Site-2 has > host H-2, and attaches to PE-2. Suppose further that H-1 and H-2 have > addresses "in the same subnet". PE-2 discovers the presence of H-2, and > so distributes a host route for it; PE-11 and PE-12 import this route. > > Now H-1 sends an ARP request for H-2. PE-11 and PE-12 both generate a > proxy response. That by itself is probably enough to mess up the > communication from Site-1 to H-2. But PE-11 and PE-12 will see each > other's proxy responses, and hence will both conclude that H-2 is local. > So they will both generate host routes for H-2 and distribute them to the > other PEs. Now all the other PEs will think that H-2 is reachable via > PE-11, PE-12, and PE-2. This will certainly screw up any attempts to > reach H-2 from other sites. > > I think that the draft either needs to state that it is not applicable > when two PEs attach to a site (unless they use VRRP), or else some > protocol for choosing the "master PE" at a site needs to be developed. > > - I don't completely follow some of the procedures for inter-subnet routing. > From section 3.1.2: > > "Assume host A sends an ARP request for its default gateway > (i.e., 1.1.1.4) prior to communicating with a destination > host outside of its subnet. Upon receiving this ARP > request, PE-1 acting as an ARP proxy returns its own MAC > address as a response. Host A then sends a packet for Host > B to PE-1. PE-1 tunnels such packet towards PE-2 according > to the default route learnt from PE-2, which in turn > forwards that packet to GW." > > It seems to me that PE-1 will forward the packet according to the routes > in its VRF (i.e., PE-1 actually functions as the default gateway), and the > packet may or may not actually go to PE-2 and then to GW. If Host B is > out on the Internet, and there are Internet gateways at several sites, the > one that actually gets used will not necessarily be the one that Host A is > configured to use. > > I'm not sure this is a problem; it could be considered to be a feature. > But it is certainly something that the draft should discuss. > > - If host discovery is going to be done by snooping ARP traffic, and if host > discovery is going to cause BGP activity, then we have some scaling and > security issues that need to be discussed. > > By generating a "bogus" ARP response for host H, one can force a PE to > originate a host route, and this in turn will cause some amount of traffic > to H to be delivered to the wrong site. That is, the effect of a bogus > ARP Response is not limited to a particular site. This certainly needs to > be mentioned in the Security Considerations section. > > Further, by generating an arbitrary number of bogus ARP responses, one can > cause a PE to originate an arbitrary number of host routes, thus causing > an excessive amount of BGP activity. This is an attack vector which also > needs to be discussed in the Security Considerations. > > So I don't think it's true that the draft introduces "no new security > considerations". > > - The section on multicast mentions tunnels, but I think an important issue > in multicast is going to be how the PIM Designated Routers at a given site > do the RPF determination, and this isn't even mentioned. > > - What is "VPN Instance Space Scalability"? (I don't know the term "VPN > Instance Space".) I'm not sure what is the intent of the draft. I read it as "Proxy ARP is a useful component" in a DC solution. That seems to be a reasonable statement. As you point out however when the document attempts to specify behavior that tends to be incomplete and in some cases incorrect. Perhaps the document should specify less and just default to the concept of proxy ARP and standard L3VPN forwarding rules. Both which are well understood. Pedro.
