Hi Eric,

Thanks a lot for your detailed comments.



Since Robert had given excellent feedback to most of your comments in the 
following email, I will respond to the remaining unsolved comments on basis of 
this email.



发件人: L3VPN [mailto:[email protected]] 代表 Robert Raszuk
发送时间: 2014年2月8日 2:44
收件人: [email protected]
抄送: L3VPN WG
主题: Re: comments on draft-xu-l3vpn-virtual-subnet-03

Hello Eric,

Few comments on some of your points ...

- Section 3.3:

      PE routers SHOULD be able to discover their local CE hosts and keep
      the list of these hosts up to date in a timely manner so as to ensure
      the availability and accuracy of the corresponding host routes
      originated from them.

   Surely this is a MUST.  I don't see how the scheme can work without a
   responsive and reliable discovery mechanism of some sort.

   Since the draft does not require any particular discovery scheme, perhaps
   it should at least characterize the set of acceptable schemes.


When we discussed this point among co-authors such set of applicable schemas 
could be listed, but it rather a local choice of the PE participating in this 
VPN model.

It can be anything from monitoring GARP messages to instrumentation of the 
orchestration layer. Yes while in traditional networks that equals management 
layer and is consider rather bad idea in DC environments it has been made to 
work reasonably well. Obvious for some services like IaaS it is a must have 
tool.


- Is a PE supposed to discover all the local hosts, and originate a host
  route into BGP for each one of them?  Or are host routes originated only
  for a subset of the local hosts?

  I don't see anything in the draft that says how to choose a subset.


Again if you think from the perspective of DC not WAN this can be for all or 
for subset depending what orchestration tells PE to do. After all this is 
orchestration layer which is controlling those VMs and is provisioning 
networking for it (example neutron in openstack).


  However, it seems like in the intended use case, the hosts are VMs, and
  the draft says that a data center can contain millions of VMs.  Is each PE
  going to originate host routes for millions of VMs?


Nope. There is usually more then one PE serving those millions of VMs. Each PE 
just like in vanilla L3VPN only holds those VPN routes which are needed. It 
also only originates routes for CEs (read VMs behind it).


>
  If so, I don't understand why the scheme is claimed to be scalable.  A
  solution that relies on millions of BGP-distributed host routes might be
  expected to exhibit some scaling problems having to do with
  routing/forwarding table size.  (Note that section 3.9 proposes to
  distribute host routes not only to other DCs, but to "cloud user sites",
  as well.)

This is getting interesting now. Am I suppose now to argue with you Eric that 
L3VPN is scalable ? You know how it is scalable and in this case just as well 
as in other work in the same working group [hint: draft-ietf-l3vpn-end-systems] 
there is always host routes distrtibuted within DC.

Inter-DC or DC to user rather always depends on careful choice and when 
possible aggregation at the gateway.

  with host routes, but the Internet doesn't run on host routes because of
  the scaling issues.

Well we are not talking about global flooding of host routes so this may be a 
subtle difference when comparing it to the Internet.

Also these days the PEs are a bit different especially their data plane as 
compared to traditional routers ;)


- I wondered originally whether the intention is that host routes are
  distributed only in the exception cases, where a VM moves off its "native"

subnet.  But the draft doesn't seem to say anything like that.

Perhaps it should. Good point.

[Xiaohu] will consider this in the revision.

  discusses everything that might possibly break.  For instance, will DHCP
  still work?

Yes via DHCP relay or proxy features.


  Probably the answer is going to be "anything that doesn't work any more
  isn't needed in the DC environment".  Maybe the draft just needs to state
  its applicability restrictions more clearly.

Well it is hard to describe how one is to build entire DC in one IETF draft ... 
So only immediately relevant points to the idea at hand are included.


- To provide good scaling, one needs to consider not only the number of VMs,
  but the rate of movement.  How many VMs per second move from one DC to
  another, how many VMs per second are created, how many destroyed?  These
  rates will have considerable impact on the control plane.  This issue
  isn't even mentioned in the draft.

I do not think that BGP control plane is a problem here. As you know it can 
converge pretty fast especially its latest multithreaded implementations.

The limiting factor here would be the VM data and state replication rate which 
IMHO would be few orders of magnitude lower then BGP update/withdraw propagation



- If a PE originates a host route, I don't see anything in the draft that
  will cause the host route to time out and be withdrawn if the host
  disappears.  (There is discussion of what to do if the host shows up
  somewhere else, but I didn't see any discussion of what to do if the host
  just disappears altogether.)  Surely a scheme based on host routes for
  movable hosts needs some sort of 'garbage collection'.


Again the same as above. Orchestration will notice VM going down. It will 
instruct PE (most likely via API) to remove such routes.


- The draft suggests that if a PE, say PE1, has originated a host route for
  host H, and then PE1 sees a host route for H from another PE, say PE2,
  that PE1 should try to figure out whether H is still local, and withdraw
  the route if it concludes that H is no longer local.

I think this section of the draft may need some work. I must have missed it.

[Xiaohu] the above is only applicable in some corner case, as mentioned in a 
previous email.

Best regards,
Xiaohu

  I believe this presumes that all VRFs have unique RDs; that should be
  stated.  (Otherwise a route reflector might not forward all the routes.)

  Suppose PE1 sees a host route for H from PE2, but PE1 then concludes that
  H is still local.  Is the local route to be considered preferable?  Does
  it install the BGP route from PE2, but not issue the proxy ARP responses?
  The draft should state the procedures for this case.


I would not add any special handing nor any special new procedures for this 
case. I would just treat this as regular L3VPN PE will handle the two paths for 
a given net.

Choose the overall best and install the overall best in the data plane. It can 
be local or it can be remote depending on best path criteria.


  What if there is a local BGP route for PE2, (say, from a CE router), but
  the BGP decision process chooses the remote route?


Then all other PEs will also choose the "remote route" hence no issue isn't it ?


- It seems to me that the scheme does not work at all if a single site is
  attached to two PEs, UNLESS those PEs negotiate some sort of
  primary/secondary relationship.


Two answers to that - excellent question:

* In DC most common case is that VMs are served over one PE. Yes I know this is 
not again to what we are used to in the CE-PE WAN case. But the focus in DC is 
on making the compute node redundant not single VMs which are running on it.

* However if you would have such case of VM attached to two PEs (let's say over 
a common LAN subnet) it would be regular anycast VPN case where said PEs would 
advertise the host route independently. For return traffic and the L3VPN side 
this should work just fine.

For host/VM side it is just like the draft says one could use VRRP to select 
master outbound PE if they are connected over lan or any other alternative 
(again orchestration :) or just periodic probing.


  Suppose Site-1 has Host H-1, and attaches to PE-11 and PE-12.  Site-2 has
  host H-2, and attaches to PE-2.  Suppose further that H-1 and H-2 have
  addresses "in the same subnet".  PE-2 discovers the presence of H-2, and
  so distributes a host route for it; PE-11 and PE-12 import this route.

  Now H-1 sends an ARP request for H-2.  PE-11 and PE-12 both generate a
  proxy response.  That by itself is probably enough to mess up the
  communication from Site-1 to H-2.

Actually not .. this usually works :)

 But PE-11 and PE-12 will see each
  other's proxy responses, and hence will both conclude that H-2 is local.
  So they will both generate host routes for H-2

Well I would not jump to that conclusion. Just presence of learned entry in the 
ARP table should not trigger the host route auto-generation.

In any case even watching GARP may not be widely deployed. I think mainly it 
will be orchestration layer.

Best regards,
R.

Reply via email to