Re: [nvo3] LAG/ECMP load-balancing problems facing overlay networks

Jeff Wheeler Mon, 30 Sep 2013 10:02:47 -0700

On Mon, Sep 30, 2013 at 12:23 PM, Lucy yong <[email protected]> wrote:
> [Lucy] We should distinct flow forwarding and ECMP. This is about flow 
> forwarding, not ECMP. Does NIC implementing NVO3 offloading mean VXLAN/NVGRE 
> encap/decap and tunnel function or other as well? When implementing NVO3 
> offloading, does that mean that hypervisor implementing vSwitching need to 
> run on one CPU and all the packets need to pass this hypervisor first?


If the NIC has NVO3 offloading, the hypervisor will *not* need to
interact with the traffic -- as long as the use of NVO3 on the VMs are
within the design limits of the NIC's offloading implementation.  The
NIC itself would encapsulate / decapsulate.  This is the advantage of
NIC offloading -- removing the CPU expense of executing the hypervisor
for encapsulation / decapsulation.

If the NIC has this capability, it would be able to direct packets to
the appropriate DMA queue (which can have affinity with a CPU) for the
guest VM.  In addition, the hypervisor would not need to be executed
at all to receive packets (and maybe not even to transmit; depending
on security needs.)  This provides ideal resource utilization.

> Entropy here is really not smart all the time.  It is a useful option for 
> some scenarios, but others will require a fixed L4 header per NVO3 
> termination, so the appropriate VM/CPU can receive packets from the NIC 
> efficiently, without executing a hypervisor at all.
> [Lucy] Flow entropy relates to ECMP LB process, does not apply to flow 
> forwarding, i.e. flow entropy can't be used in flow forwarding. Thus, the 
> thread title and discussed content are difference topics.

These are actually related, if the entropy information is carried in a
part of the outer-packet (like L4 field) that may be used by NICs.

I will explain a scenario below.  My point is not really to highlight
this specific case, but to highlight the fact that there is very, very
little discussion in the area of NIC/host interfaces on the NVO3
mailing list; and insufficient consideration given to this area.  This
might be because many of the posters are unfamiliar with this subject.


Imagine that there is no NIC offloading of NVO3, and a hypervisor uses
2 or more CPUs, CPU#1 and CPU#2, for its various functions, including
NIC Rx/Tx.  Two different traffic flows arrive, called F and G, and
these flows are both destined for guest VM "A" which executes on one
or more CPUs, for simplicity, call this CPU#3.

If the NIC happens to deliver flow F to the hypervisor DMA queue for
CPU#1, and flow G to the hypervisor DMA queue for CPU#2, and also, the
guest VM happens to receive these flows in the same bounce-buffer (the
VM might only have one such buffer) then, as packets are received in
the order F-G-F-G-F-G the CPU#1 and CPU#2 may be flushing
bounce-buffer descriptor rows from their L1 caches if the descriptors
share cachelines (implementation dependent.)  CPU#3 will also need to
be reading these same cachelines, most likely causing traffic to the
L3 cache (often the only memory interface among distant CPU cores on a
single die.)  In modern CPUs this can mean dozens of additional Hz
wasted per each packet receipt just waiting on the cache coherency
machinery to allow memory to be read or written.

So the above situation means that having two CPU cores for receiving
network traffic, performing hypervisor / vSwitch-based encapsulation &
decapsulation, and bouncing traffic into the guest VMs could actually
perform worse than using only one CPU core -- because the NIC might
use flow entropy information which is carried in L4 headers and used
in its hashing algorithm -- but the guest VM may not have multiple Rx
queues to take advantage of this.


The same entropy information which benefits ECMP/LAG hashing in the
datacenter network might hinder the NIC -> hypervisor -> guest VM
interface, rather than help it, depending on a variety of
implementation-specific issues.  For example, the whole problem above
can be mitigated if the hypervisor -> guest VM packet descriptors do
not span across cachelines; or if the NIC is configured to ignore L4
information in Rx DMA ring hashing, or if the L4 fields identify the
guest VM instead of containing entropy from the inner-flow, or if the
guest VM has multiple different bounce-buffers for packet receipt, at
least one per hypervisor CPU that might be handling Rx network
traffic, etc.

Entropy data for LAG/ECMP is good and necessary for the datacenter
network, but its affect on the end-hosts should be considered and
discussed.  Such discussion may be helpful to software folks, NIC
vendors, etc. as they decide to implement NVO3.
-- 
Jeff S Wheeler <[email protected]>
Sr Network Operator  /  Innovative Network Concepts
_______________________________________________
nvo3 mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/nvo3

Re: [nvo3] LAG/ECMP load-balancing problems facing overlay networks

Reply via email to