Hi Eric,

Thanks a lot for your detailed comments. 

Since Pedro had given excellent feedback to most of your comments in the 
following email, I will respond to the remaining unsolved comments on basis of 
this email.

> -----邮件原件-----
> 发件人: L3VPN [mailto:[email protected]] 代表 Pedro Roque Marques
> 发送时间: 2014年2月8日 2:43
> 收件人: Eric Rosen (erosen)
> 抄送: L3VPN WG
> 主题: Re: comments on draft-xu-l3vpn-virtual-subnet-03
> 
> Eric,
> I'd just like to provide some feedback to your comments... no opinion on the
> draft.
> Please see inline.
> 
> On Feb 7, 2014, at 9:50 AM, Eric Rosen <[email protected]> wrote:
> 
> > I have a few questions and concerns about draft-xu-l3vpn-virtual-subnet-03.
> >
> > - Section 3.3:
> >
> >      PE routers SHOULD be able to discover their local CE hosts and keep
> >      the list of these hosts up to date in a timely manner so as to ensure
> >      the availability and accuracy of the corresponding host routes
> >      originated from them.
> >
> >   Surely this is a MUST.  I don't see how the scheme can work without a
> >   responsive and reliable discovery mechanism of some sort.
> 
> It is quite likely that the authors are assuming that there is a management 
> system
> that creates the "guest" (CE) and assigns them to a specific host (PE). That
> management system has to for instance spawn the guest instance and is aware
> of the instances network requirements.
> I don't believe that this document should specify the API used by the
> management system to inform the networking component on the host system.
> 
> >
> >   Since the draft does not require any particular discovery scheme, perhaps
> >   it should at least characterize the set of acceptable schemes.
> >
> > - Is a PE supposed to discover all the local hosts, and originate a
> > host  route into BGP for each one of them?  Or are host routes
> > originated only  for a subset of the local hosts?
> >
> >  I don't see anything in the draft that says how to choose a subset.
> >  However, it seems like in the intended use case, the hosts are VMs,
> > and  the draft says that a data center can contain millions of VMs.
> > Is each PE  going to originate host routes for millions of VMs?
> 
> In the context of the document, i believe that a PE is considered to be the 
> host.
> But to your point, we have proof of existence that for a BGP speaker to 
> advertise
> millions of BGP L3VPN routes is quite feasible. RRs in currently L3VPN
> deployments do this all the time. And L3VPN carrier networks have a higher
> number of reachable routes than most DCs have VMs using the tools available
> today.
> 
> >
> >  If so, I don't understand why the scheme is claimed to be scalable.
> > A  solution that relies on millions of BGP-distributed host routes
> > might be  expected to exhibit some scaling problems having to do with
> > routing/forwarding table size.  (Note that section 3.9 proposes to
> > distribute host routes not only to other DCs, but to "cloud user
> > sites",  as well.)
> 
> The routing / table forwarding sizes in traditional PEs have limitations that 
> are
> driven by their design. For instance a PE designed to process 1T of traffic 
> will
> typically have to chose a hardware architecture that limits the forwarding 
> tables
> to single digit megabytes of very fast memory. A server that has to process
> 10/20G of traffic can probably afford to have its forwarding tables in DRAM. 
> If
> you allocate 1% of the memory of a modern server to forwarding tables that
> would be 1.9G.
> 
> Scaling is relative. L3VPN as it is deployable today in carrier networks 
> would not
> be possible with a device like the cisco AGS+.
> We have all heard forever that L3VPN doesn't scale... yet that doesn't seem to
> deter carriers from using it to provide services to their customers.
> 
> >
> >  The draft talks about the increased path optimality that one gains
> > from  using host routes.  Well, everyone knows that you get more
> > optimal routing  with host routes, but the Internet doesn't run on
> > host routes because of  the scaling issues.
> >
> >
> > - I wondered originally whether the intention is that host routes are
> > distributed only in the exception cases, where a VM moves off its "native"
> >  subnet.  But the draft doesn't seem to say anything like that.  It
> > seems  rather to be eliminating the traditional notion of a localized
> > subnet, and  then discussing how to "fool" the hosts into thinking
> > that the localized  subnets still exist.  But this raises the question
> > of whether the draft  discusses everything that might possibly break.
> > For instance, will DHCP  still work?
> 
> DHCP deployments typically depend on a DHCP relay rather than pure L2
> reachability.
> Having a distributed DHCP relay is an option.
> 
> >
> >  Probably the answer is going to be "anything that doesn't work any
> > more  isn't needed in the DC environment".  Maybe the draft just needs
> > to state  its applicability restrictions more clearly.
> >
> > - To provide good scaling, one needs to consider not only the number
> > of VMs,  but the rate of movement.  How many VMs per second move from
> > one DC to  another, how many VMs per second are created, how many
> destroyed?
> 
> The average rates are going to be measured in VM changes per minute or tens of
> minutes.
> In smal clusters (which is where most tools are today) there just aren't that
> many VMs nor do they change that frequently.
> Very very large clusters using proprietary orchestration systems that have 
> been
> optimized for dynamic workloads will schedule about a dozen VMs per minute
> when very busy. This last case is for people that do their entire application 
> stack
> and have very dynamic applications and are about 10+ years ahead of the
> general purpose market.
> 
> Let me put it to you in another way: for a VM to be spawn, there needs to be a
> very chattery XML/json exchange to instruct the host to go fetch a large image
> (Gigabytes) of  data  and then start that VM. To move it is twice the cost. A
> route update is a very very tiny effort compared to all of this even if you 
> do not
> write your orchestration code in interpreted languages with no concurrency.
> 
> In general the scale arguments that tend to come up are orders of magnitude
> away from reality.
> 
> >  These
> >  rates will have considerable impact on the control plane.  This issue
> > isn't even mentioned in the draft.
> >
> > - If a PE originates a host route, I don't see anything in the draft
> > that  will cause the host route to time out and be withdrawn if the
> > host  disappears.  (There is discussion of what to do if the host
> > shows up  somewhere else, but I didn't see any discussion of what to
> > do if the host  just disappears altogether.)  Surely a scheme based on
> > host routes for  movable hosts needs some sort of 'garbage collection'.
> 
> The "garbage collection" is the responsibility of the orchestration system.
> 
> >
> > - The draft suggests that if a PE, say PE1, has originated a host
> > route for  host H, and then PE1 sees a host route for H from another
> > PE, say PE2,  that PE1 should try to figure out whether H is still
> > local, and withdraw  the route if it concludes that H is no longer local.
> 
> That would be unnecessary.

Yes, the above procedure is absolutely unnecessary in the case where the 
orchestration system or VDP is resorted for local CE host discovery. As said in 
the doc, the above trick is only useful " In the case where there is no 
explicit VM detachment notification mechanism".

> 
> >
> >  I believe this presumes that all VRFs have unique RDs; that should be
> > stated.  (Otherwise a route reflector might not forward all the
> > routes.)
> >
> >  Suppose PE1 sees a host route for H from PE2, but PE1 then concludes
> > that  H is still local.  Is the local route to be considered
> > preferable?  Does  it install the BGP route from PE2, but not issue the 
> > proxy
> ARP responses?
> >  The draft should state the procedures for this case.
> >
> >  What if there is a local BGP route for PE2, (say, from a CE router),
> > but  the BGP decision process chooses the remote route?
> >
> > - It seems to me that the scheme does not work at all if a single site
> > is  attached to two PEs, UNLESS those PEs negotiate some sort of
> > primary/secondary relationship.
> >
> >  The draft does mention this:
> >
> >       "In the scenario where a given VPN site (i.e., a data
> >       center) is multi-homed to more than one PE router via an
> >       Ethernet switch or an Ethernet network, Virtual Router
> >       Redundancy Protocol (VRRP) [RFC5798] is usually enabled on
> >       these PE routers. In this case, only the PE router being
> >       elected as the VRRP Master is allowed to perform the
> >       ARP/ND proxy function."
> >
> >  But I'm not sure what to make of the "usually".  The draft does not
> > say that its applicability is restricted to the cases where either (a)
> > a  site attaches only to a single PE, or (b) the site attaches to two
> > PEs  that are running VRRP with each other.  So we need to examine
> > what will  happen if the site attaches to two PEs that are not running VRRP.

How about replacing "is usually" by "SHOULD be" or "MUST be"?

> >  Suppose Site-1 has Host H-1, and attaches to PE-11 and PE-12.  Site-2
> > has  host H-2, and attaches to PE-2.  Suppose further that H-1 and H-2
> > have  addresses "in the same subnet".  PE-2 discovers the presence of
> > H-2, and  so distributes a host route for it; PE-11 and PE-12 import this 
> > route.
> >
> >  Now H-1 sends an ARP request for H-2.  PE-11 and PE-12 both generate
> > a  proxy response.  That by itself is probably enough to mess up the
> > communication from Site-1 to H-2.  But PE-11 and PE-12 will see each
> > other's proxy responses, and hence will both conclude that H-2 is local.
> >  So they will both generate host routes for H-2 and distribute them to
> > the  other PEs.  Now all the other PEs will think that H-2 is
> > reachable via  PE-11, PE-12, and PE-2.  This will certainly screw up
> > any attempts to  reach H-2 from other sites.
> >
> >  I think that the draft either needs to state that it is not
> > applicable  when two PEs attach to a site (unless they use VRRP), or
> > else some  protocol for choosing the "master PE" at a site needs to be
> developed.

Good suggestion, its applicability would be stated explicitly in the revision.

> > - I don't completely follow some of the procedures for inter-subnet routing.
> >  From section 3.1.2:
> >
> >      "Assume host A sends an ARP request for its default gateway
> >      (i.e., 1.1.1.4) prior to communicating with a destination
> >      host outside of its subnet. Upon receiving this ARP
> >      request, PE-1 acting as an ARP proxy returns its own MAC
> >      address as a response.  Host A then sends a packet for Host
> >      B to PE-1. PE-1 tunnels such packet towards PE-2 according
> >      to the default route learnt from PE-2, which in turn
> >      forwards that packet to GW."
> >
> >  It seems to me that PE-1 will forward the packet according to the
> > routes  in its VRF (i.e., PE-1 actually functions as the default
> > gateway), and the  packet may or may not actually go to PE-2 and then
> > to GW.  If Host B is  out on the Internet, and there are Internet
> > gateways at several sites, the  one that actually gets used will not
> > necessarily be the one that Host A is  configured to use.

Your observation is correct. In fact, your above argument is reflected in the 
case demonstrated in Figure 4. The above description you quoted is just 
applicable in the case demonstrated in Figure 2.

> >  I'm not sure this is a problem; it could be considered to be a feature.
> >  But it is certainly something that the draft should discuss.
> >
> > - If host discovery is going to be done by snooping ARP traffic, and
> > if host  discovery is going to cause BGP activity, then we have some
> > scaling and  security issues that need to be discussed.
> >
> >  By generating a "bogus" ARP response for host H, one can force a PE
> > to  originate a host route, and this in turn will cause some amount of
> > traffic  to H to be delivered to the wrong site.  That is, the effect
> > of a bogus  ARP Response is not limited to a particular site.  This
> > certainly needs to  be mentioned in the Security Considerations section.
> >
> >  Further, by generating an arbitrary number of bogus ARP responses,
> > one can  cause a PE to originate an arbitrary number of host routes,
> > thus causing  an excessive amount of BGP activity.  This is an attack
> > vector which also  needs to be discussed in the Security Considerations.
> >
> >  So I don't think it's true that the draft introduces "no new security
> > considerations".

As Pedro and Robert had pointed out, it's the orchestration system that would 
be used for host discovery in most cases. Host discovery by snooping ARP 
traffic may only be applicable in some corner cases. Hence, how about we 
mention the potential security risks as you have mentioned above and then state 
that once that approach is used, some mechanism must be provisioned to avoid 
the risk of ARP spoof, such as DHCP snooping-based ARP check, the maximum 
number of ARP entries on the basis of per VPN instance.

> > - The section on multicast mentions tunnels, but I think an important
> > issue  in multicast is going to be how the PIM Designated Routers at a
> > given site  do the RPF determination, and this isn't even mentioned.

There is no difference from the normal MVPN mechanism. 

> > - What is "VPN Instance Space Scalability"?  (I don't know the term
> > "VPN  Instance Space".)

The above term means the maximum number of VPN instances, and to avoid any 
confusion with the "VPN route space", the above term is used accordingly. Do 
you have any better suggestion?

Best regards,
Xiaohu

> I'm not sure what is the intent of the draft. I read it as "Proxy ARP is a 
> useful
> component" in a DC solution.
> That seems to be a reasonable statement. As you point out however when the
> document attempts to specify behavior that tends to be incomplete and in some
> cases incorrect. Perhaps the document should specify less and just default to 
> the
> concept of proxy ARP and standard L3VPN forwarding rules. Both which are
> well understood.
> 
>   Pedro.
> 

Reply via email to