Hello Eric,

Few comments on some of your points ...

- Section 3.3:
>
>       PE routers SHOULD be able to discover their local CE hosts and keep
>       the list of these hosts up to date in a timely manner so as to ensure
>       the availability and accuracy of the corresponding host routes
>       originated from them.
>
>    Surely this is a MUST.  I don't see how the scheme can work without a
>    responsive and reliable discovery mechanism of some sort.
>
>    Since the draft does not require any particular discovery scheme,
> perhaps
>    it should at least characterize the set of acceptable schemes.
>


When we discussed this point among co-authors such set of applicable
schemas could be listed, but it rather a local choice of the PE
participating in this VPN model.

It can be anything from monitoring GARP messages to instrumentation of the
orchestration layer. Yes while in traditional networks that equals
management layer and is consider rather bad idea in DC environments it has
been made to work reasonably well. Obvious for some services like IaaS it
is a must have tool.


- Is a PE supposed to discover all the local hosts, and originate a host
>   route into BGP for each one of them?  Or are host routes originated only
>   for a subset of the local hosts?
>
>   I don't see anything in the draft that says how to choose a subset.
>


Again if you think from the perspective of DC not WAN this can be for all
or for subset depending what orchestration tells PE to do. After all this
is orchestration layer which is controlling those VMs and is provisioning
networking for it (example neutron in openstack).


  However, it seems like in the intended use case, the hosts are VMs, and
>   the draft says that a data center can contain millions of VMs.  Is each
> PE
>   going to originate host routes for millions of VMs?
>


Nope. There is usually more then one PE serving those millions of VMs.
Each PE just like in vanilla L3VPN only holds those VPN routes which are
needed. It also only originates routes for CEs (read VMs behind it).


> 
  If so, I don't understand why the scheme is claimed to be scalable.  A

>   solution that relies on millions of BGP-distributed host routes might be
>   expected to exhibit some scaling problems having to do with
>   routing/forwarding table size.  (Note that section 3.9 proposes to
>   distribute host routes not only to other DCs, but to "cloud user sites",
>   as well.)
>

This is getting interesting now. Am I suppose now to argue with you Eric
that L3VPN is scalable ? You know how it is scalable and in this case just
as well as in other work in the same working group [hint:
draft-ietf-l3vpn-end-systems] there is always host routes distrtibuted
within DC.

Inter-DC or DC to user rather always depends on careful choice and when
possible aggregation at the gateway.


>   with host routes, but the Internet doesn't run on host routes because of
>   the scaling issues.
>

Well we are not talking about global flooding of host routes so this may
be a subtle difference when comparing it to the Internet.

Also these days the PEs are a bit different especially their data plane as
compared to traditional routers ;)


- I wondered originally whether the intention is that host routes are
>   distributed only in the exception cases, where a VM moves off its
> "native"
> 
> subnet.  But the draft doesn't seem to say anything like that.


Perhaps it should. Good point.

  discusses everything that might possibly break.  For instance, will DHCP
>   still work?
>
>
Yes via DHCP relay or proxy features.  


  Probably the answer is going to be "anything that doesn't work any more
>   isn't needed in the DC environment".  Maybe the draft just needs to state
>   its applicability restrictions more clearly.
>

Well it is hard to describe how one is to build entire DC in one IETF
draft ... So only immediately relevant points to the idea at hand are
included.


- To provide good scaling, one needs to consider not only the number of VMs,
>   but the rate of movement.  How many VMs per second move from one DC to
>   another, how many VMs per second are created, how many destroyed?  These
>   rates will have considerable impact on the control plane.  This issue
>   isn't even mentioned in the draft.
>

I do not think that BGP control plane is a problem here. As you know it
can converge pretty fast especially its latest multithreaded
implementations.

The limiting factor here would be the VM data and state replication rate
which IMHO would be few orders of magnitude lower then BGP update/withdraw
propagation



>
> - If a PE originates a host route, I don't see anything in the draft that
>   will cause the host route to time out and be withdrawn if the host
>   disappears.  (There is discussion of what to do if the host shows up
>   somewhere else, but I didn't see any discussion of what to do if the host
>   just disappears altogether.)  Surely a scheme based on host routes for
>   movable hosts needs some sort of 'garbage collection'.
>


Again the same as above. Orchestration will notice VM going down. It will
instruct PE (most likely via API) to remove such routes.


- The draft suggests that if a PE, say PE1, has originated a host route for
>   host H, and then PE1 sees a host route for H from another PE, say PE2,
>   that PE1 should try to figure out whether H is still local, and withdraw
>   the route if it concludes that H is no longer local.
>
>
I think this section of the draft may need some work. I must have missed
it. 



>   I believe this presumes that all VRFs have unique RDs; that should be
>   stated.  (Otherwise a route reflector might not forward all the routes.)
>
>   Suppose PE1 sees a host route for H from PE2, but PE1 then concludes that
>   H is still local.  Is the local route to be considered preferable?  Does
>   it install the BGP route from PE2, but not issue the proxy ARP responses?
>   The draft should state the procedures for this case.
>


I would not add any special handing nor any special new procedures for
this case. I would just treat this as regular L3VPN PE will handle the two
paths for a given net.

Choose the overall best and install the overall best in the data plane. It
can be local or it can be remote depending on best path criteria.


  What if there is a local BGP route for PE2, (say, from a CE router), but
>   the BGP decision process chooses the remote route?
>


Then all other PEs will also choose the "remote route" hence no issue
isn't it ? 



> - It seems to me that the scheme does not work at all if a single site is
>   attached to two PEs, UNLESS those PEs negotiate some sort of
>   primary/secondary relationship.
>


Two answers to that - excellent question:

* In DC most common case is that VMs are served over one PE. Yes I know
this is not again to what we are used to in the CE-PE WAN case. But the
focus in DC is on making the compute node redundant not single VMs which
are running on it.

* However if you would have such case of VM attached to two PEs (let's say
over a common LAN subnet) it would be regular anycast VPN case where said
PEs would advertise the host route independently. For return traffic and
the L3VPN side this should work just fine.

For host/VM side it is just like the draft says one could use VRRP to
select master outbound PE if they are connected over lan or any other
alternative (again orchestration :) or just periodic probing.


  Suppose Site-1 has Host H-1, and attaches to PE-11 and PE-12.  Site-2 has
>   host H-2, and attaches to PE-2.  Suppose further that H-1 and H-2 have
>   addresses "in the same subnet".  PE-2 discovers the presence of H-2, and
>   so distributes a host route for it; PE-11 and PE-12 import this route.
>
>   Now H-1 sends an ARP request for H-2.  PE-11 and PE-12 both generate a
>   proxy response.  That by itself is probably enough to mess up the
>   communication from Site-1 to H-2.


Actually not .. this usually works :) 


>  But PE-11 and PE-12 will see each
>   other's proxy responses, and hence will both conclude that H-2 is local.
>   So they will both generate host routes for H-2


Well I would not jump to that conclusion. Just presence of learned entry
in the ARP table should not trigger the host route auto-generation.

In any case even watching GARP may not be widely deployed. I think mainly
it will be orchestration layer.

Best regards,
R.

Reply via email to