I have a few questions and concerns about draft-xu-l3vpn-virtual-subnet-03.
- Section 3.3:
PE routers SHOULD be able to discover their local CE hosts and keep
the list of these hosts up to date in a timely manner so as to ensure
the availability and accuracy of the corresponding host routes
originated from them.
Surely this is a MUST. I don't see how the scheme can work without a
responsive and reliable discovery mechanism of some sort.
Since the draft does not require any particular discovery scheme, perhaps
it should at least characterize the set of acceptable schemes.
- Is a PE supposed to discover all the local hosts, and originate a host
route into BGP for each one of them? Or are host routes originated only
for a subset of the local hosts?
I don't see anything in the draft that says how to choose a subset.
However, it seems like in the intended use case, the hosts are VMs, and
the draft says that a data center can contain millions of VMs. Is each PE
going to originate host routes for millions of VMs?
If so, I don't understand why the scheme is claimed to be scalable. A
solution that relies on millions of BGP-distributed host routes might be
expected to exhibit some scaling problems having to do with
routing/forwarding table size. (Note that section 3.9 proposes to
distribute host routes not only to other DCs, but to "cloud user sites",
as well.)
The draft talks about the increased path optimality that one gains from
using host routes. Well, everyone knows that you get more optimal routing
with host routes, but the Internet doesn't run on host routes because of
the scaling issues.
- I wondered originally whether the intention is that host routes are
distributed only in the exception cases, where a VM moves off its "native"
subnet. But the draft doesn't seem to say anything like that. It seems
rather to be eliminating the traditional notion of a localized subnet, and
then discussing how to "fool" the hosts into thinking that the localized
subnets still exist. But this raises the question of whether the draft
discusses everything that might possibly break. For instance, will DHCP
still work?
Probably the answer is going to be "anything that doesn't work any more
isn't needed in the DC environment". Maybe the draft just needs to state
its applicability restrictions more clearly.
- To provide good scaling, one needs to consider not only the number of VMs,
but the rate of movement. How many VMs per second move from one DC to
another, how many VMs per second are created, how many destroyed? These
rates will have considerable impact on the control plane. This issue
isn't even mentioned in the draft.
- If a PE originates a host route, I don't see anything in the draft that
will cause the host route to time out and be withdrawn if the host
disappears. (There is discussion of what to do if the host shows up
somewhere else, but I didn't see any discussion of what to do if the host
just disappears altogether.) Surely a scheme based on host routes for
movable hosts needs some sort of 'garbage collection'.
- The draft suggests that if a PE, say PE1, has originated a host route for
host H, and then PE1 sees a host route for H from another PE, say PE2,
that PE1 should try to figure out whether H is still local, and withdraw
the route if it concludes that H is no longer local.
I believe this presumes that all VRFs have unique RDs; that should be
stated. (Otherwise a route reflector might not forward all the routes.)
Suppose PE1 sees a host route for H from PE2, but PE1 then concludes that
H is still local. Is the local route to be considered preferable? Does
it install the BGP route from PE2, but not issue the proxy ARP responses?
The draft should state the procedures for this case.
What if there is a local BGP route for PE2, (say, from a CE router), but
the BGP decision process chooses the remote route?
- It seems to me that the scheme does not work at all if a single site is
attached to two PEs, UNLESS those PEs negotiate some sort of
primary/secondary relationship.
The draft does mention this:
"In the scenario where a given VPN site (i.e., a data
center) is multi-homed to more than one PE router via an
Ethernet switch or an Ethernet network, Virtual Router
Redundancy Protocol (VRRP) [RFC5798] is usually enabled on
these PE routers. In this case, only the PE router being
elected as the VRRP Master is allowed to perform the
ARP/ND proxy function."
But I'm not sure what to make of the "usually". The draft does not
say that its applicability is restricted to the cases where either (a) a
site attaches only to a single PE, or (b) the site attaches to two PEs
that are running VRRP with each other. So we need to examine what will
happen if the site attaches to two PEs that are not running VRRP.
Suppose Site-1 has Host H-1, and attaches to PE-11 and PE-12. Site-2 has
host H-2, and attaches to PE-2. Suppose further that H-1 and H-2 have
addresses "in the same subnet". PE-2 discovers the presence of H-2, and
so distributes a host route for it; PE-11 and PE-12 import this route.
Now H-1 sends an ARP request for H-2. PE-11 and PE-12 both generate a
proxy response. That by itself is probably enough to mess up the
communication from Site-1 to H-2. But PE-11 and PE-12 will see each
other's proxy responses, and hence will both conclude that H-2 is local.
So they will both generate host routes for H-2 and distribute them to the
other PEs. Now all the other PEs will think that H-2 is reachable via
PE-11, PE-12, and PE-2. This will certainly screw up any attempts to
reach H-2 from other sites.
I think that the draft either needs to state that it is not applicable
when two PEs attach to a site (unless they use VRRP), or else some
protocol for choosing the "master PE" at a site needs to be developed.
- I don't completely follow some of the procedures for inter-subnet routing.
From section 3.1.2:
"Assume host A sends an ARP request for its default gateway
(i.e., 1.1.1.4) prior to communicating with a destination
host outside of its subnet. Upon receiving this ARP
request, PE-1 acting as an ARP proxy returns its own MAC
address as a response. Host A then sends a packet for Host
B to PE-1. PE-1 tunnels such packet towards PE-2 according
to the default route learnt from PE-2, which in turn
forwards that packet to GW."
It seems to me that PE-1 will forward the packet according to the routes
in its VRF (i.e., PE-1 actually functions as the default gateway), and the
packet may or may not actually go to PE-2 and then to GW. If Host B is
out on the Internet, and there are Internet gateways at several sites, the
one that actually gets used will not necessarily be the one that Host A is
configured to use.
I'm not sure this is a problem; it could be considered to be a feature.
But it is certainly something that the draft should discuss.
- If host discovery is going to be done by snooping ARP traffic, and if host
discovery is going to cause BGP activity, then we have some scaling and
security issues that need to be discussed.
By generating a "bogus" ARP response for host H, one can force a PE to
originate a host route, and this in turn will cause some amount of traffic
to H to be delivered to the wrong site. That is, the effect of a bogus
ARP Response is not limited to a particular site. This certainly needs to
be mentioned in the Security Considerations section.
Further, by generating an arbitrary number of bogus ARP responses, one can
cause a PE to originate an arbitrary number of host routes, thus causing
an excessive amount of BGP activity. This is an attack vector which also
needs to be discussed in the Security Considerations.
So I don't think it's true that the draft introduces "no new security
considerations".
- The section on multicast mentions tunnels, but I think an important issue
in multicast is going to be how the PIM Designated Routers at a given site
do the RPF determination, and this isn't even mentioned.
- What is "VPN Instance Space Scalability"? (I don't know the term "VPN
Instance Space".)