Re: [nvo3] LAG/ECMP load-balancing problems facing overlay networks

Lucy yong Mon, 30 Sep 2013 11:06:02 -0700

Jeff,

To implement NVO3 offloading, NIC MUST be able to support VXLAN/NVGRE/STT 
encap/decap and tunnel termination, i.e. inspect the protocol type and 
demulexing TCP/UDP ports first. The Flow entropy of these usages is embedded in 
src port. NIC does not have to use it and can use inner header directly for 
flow steering.

Your point that ECMP may hinder NIC->hypervior-> guest VM interface design is 
true. No free lunch. At least, ingress tunnel point needs to generate the flow 
entropy and encode on the packets, and egress point needs to take it off. 
However, ECMP support becomes an essential to underlying network (see nvo3 data 
plane requirement). This is just matter what we gain via where we spend money 
on. 

I do not work on NIC design. Just through my 2cent.

Regards,
Lucy

-----Original Message-----
From: Jeff Wheeler [mailto:[email protected]] 
Sent: Monday, September 30, 2013 12:02 PM
To: Lucy yong
Cc: Lizhong Jin; [email protected]
Subject: Re: [nvo3] LAG/ECMP load-balancing problems facing overlay networks

On Mon, Sep 30, 2013 at 12:23 PM, Lucy yong <[email protected]> wrote:
> [Lucy] We should distinct flow forwarding and ECMP. This is about flow 
> forwarding, not ECMP. Does NIC implementing NVO3 offloading mean VXLAN/NVGRE 
> encap/decap and tunnel function or other as well? When implementing NVO3 
> offloading, does that mean that hypervisor implementing vSwitching need to 
> run on one CPU and all the packets need to pass this hypervisor first?

If the NIC has NVO3 offloading, the hypervisor will *not* need to interact with 
the traffic -- as long as the use of NVO3 on the VMs are within the design 
limits of the NIC's offloading implementation.  The NIC itself would 
encapsulate / decapsulate.  This is the advantage of NIC offloading -- removing 
the CPU expense of executing the hypervisor for encapsulation / decapsulation.

If the NIC has this capability, it would be able to direct packets to the 
appropriate DMA queue (which can have affinity with a CPU) for the guest VM.  
In addition, the hypervisor would not need to be executed at all to receive 
packets (and maybe not even to transmit; depending on security needs.)  This 
provides ideal resource utilization.

> Entropy here is really not smart all the time.  It is a useful option for 
> some scenarios, but others will require a fixed L4 header per NVO3 
> termination, so the appropriate VM/CPU can receive packets from the NIC 
> efficiently, without executing a hypervisor at all.
> [Lucy] Flow entropy relates to ECMP LB process, does not apply to flow 
> forwarding, i.e. flow entropy can't be used in flow forwarding. Thus, the 
> thread title and discussed content are difference topics.

These are actually related, if the entropy information is carried in a part of 
the outer-packet (like L4 field) that may be used by NICs.

I will explain a scenario below.  My point is not really to highlight this 
specific case, but to highlight the fact that there is very, very little 
discussion in the area of NIC/host interfaces on the NVO3 mailing list; and 
insufficient consideration given to this area.  This might be because many of 
the posters are unfamiliar with this subject.

Imagine that there is no NIC offloading of NVO3, and a hypervisor uses
2 or more CPUs, CPU#1 and CPU#2, for its various functions, including NIC 
Rx/Tx.  Two different traffic flows arrive, called F and G, and these flows are 
both destined for guest VM "A" which executes on one or more CPUs, for 
simplicity, call this CPU#3.

If the NIC happens to deliver flow F to the hypervisor DMA queue for CPU#1, and 
flow G to the hypervisor DMA queue for CPU#2, and also, the guest VM happens to 
receive these flows in the same bounce-buffer (the VM might only have one such 
buffer) then, as packets are received in the order F-G-F-G-F-G the CPU#1 and 
CPU#2 may be flushing bounce-buffer descriptor rows from their L1 caches if the 
descriptors share cachelines (implementation dependent.)  CPU#3 will also need 
to be reading these same cachelines, most likely causing traffic to the
L3 cache (often the only memory interface among distant CPU cores on a single 
die.)  In modern CPUs this can mean dozens of additional Hz wasted per each 
packet receipt just waiting on the cache coherency machinery to allow memory to 
be read or written.

So the above situation means that having two CPU cores for receiving network 
traffic, performing hypervisor / vSwitch-based encapsulation & decapsulation, 
and bouncing traffic into the guest VMs could actually perform worse than using 
only one CPU core -- because the NIC might use flow entropy information which 
is carried in L4 headers and used in its hashing algorithm -- but the guest VM 
may not have multiple Rx queues to take advantage of this.

The same entropy information which benefits ECMP/LAG hashing in the datacenter 
network might hinder the NIC -> hypervisor -> guest VM interface, rather than 
help it, depending on a variety of implementation-specific issues.  For 
example, the whole problem above can be mitigated if the hypervisor -> guest VM 
packet descriptors do not span across cachelines; or if the NIC is configured 
to ignore L4 information in Rx DMA ring hashing, or if the L4 fields identify 
the guest VM instead of containing entropy from the inner-flow, or if the guest 
VM has multiple different bounce-buffers for packet receipt, at least one per 
hypervisor CPU that might be handling Rx network traffic, etc.

Entropy data for LAG/ECMP is good and necessary for the datacenter network, but 
its affect on the end-hosts should be considered and discussed.  Such 
discussion may be helpful to software folks, NIC vendors, etc. as they decide 
to implement NVO3.
--
Jeff S Wheeler <[email protected]>
Sr Network Operator  /  Innovative Network Concepts
_______________________________________________
nvo3 mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/nvo3

Re: [nvo3] LAG/ECMP load-balancing problems facing overlay networks

Reply via email to