On Mon, Sep 30, 2013 at 12:23 PM, Lucy yong <[email protected]> wrote: > [Lucy] We should distinct flow forwarding and ECMP. This is about flow > forwarding, not ECMP. Does NIC implementing NVO3 offloading mean VXLAN/NVGRE > encap/decap and tunnel function or other as well? When implementing NVO3 > offloading, does that mean that hypervisor implementing vSwitching need to > run on one CPU and all the packets need to pass this hypervisor first?
If the NIC has NVO3 offloading, the hypervisor will *not* need to interact with the traffic -- as long as the use of NVO3 on the VMs are within the design limits of the NIC's offloading implementation. The NIC itself would encapsulate / decapsulate. This is the advantage of NIC offloading -- removing the CPU expense of executing the hypervisor for encapsulation / decapsulation. If the NIC has this capability, it would be able to direct packets to the appropriate DMA queue (which can have affinity with a CPU) for the guest VM. In addition, the hypervisor would not need to be executed at all to receive packets (and maybe not even to transmit; depending on security needs.) This provides ideal resource utilization. > Entropy here is really not smart all the time. It is a useful option for > some scenarios, but others will require a fixed L4 header per NVO3 > termination, so the appropriate VM/CPU can receive packets from the NIC > efficiently, without executing a hypervisor at all. > [Lucy] Flow entropy relates to ECMP LB process, does not apply to flow > forwarding, i.e. flow entropy can't be used in flow forwarding. Thus, the > thread title and discussed content are difference topics. These are actually related, if the entropy information is carried in a part of the outer-packet (like L4 field) that may be used by NICs. I will explain a scenario below. My point is not really to highlight this specific case, but to highlight the fact that there is very, very little discussion in the area of NIC/host interfaces on the NVO3 mailing list; and insufficient consideration given to this area. This might be because many of the posters are unfamiliar with this subject. Imagine that there is no NIC offloading of NVO3, and a hypervisor uses 2 or more CPUs, CPU#1 and CPU#2, for its various functions, including NIC Rx/Tx. Two different traffic flows arrive, called F and G, and these flows are both destined for guest VM "A" which executes on one or more CPUs, for simplicity, call this CPU#3. If the NIC happens to deliver flow F to the hypervisor DMA queue for CPU#1, and flow G to the hypervisor DMA queue for CPU#2, and also, the guest VM happens to receive these flows in the same bounce-buffer (the VM might only have one such buffer) then, as packets are received in the order F-G-F-G-F-G the CPU#1 and CPU#2 may be flushing bounce-buffer descriptor rows from their L1 caches if the descriptors share cachelines (implementation dependent.) CPU#3 will also need to be reading these same cachelines, most likely causing traffic to the L3 cache (often the only memory interface among distant CPU cores on a single die.) In modern CPUs this can mean dozens of additional Hz wasted per each packet receipt just waiting on the cache coherency machinery to allow memory to be read or written. So the above situation means that having two CPU cores for receiving network traffic, performing hypervisor / vSwitch-based encapsulation & decapsulation, and bouncing traffic into the guest VMs could actually perform worse than using only one CPU core -- because the NIC might use flow entropy information which is carried in L4 headers and used in its hashing algorithm -- but the guest VM may not have multiple Rx queues to take advantage of this. The same entropy information which benefits ECMP/LAG hashing in the datacenter network might hinder the NIC -> hypervisor -> guest VM interface, rather than help it, depending on a variety of implementation-specific issues. For example, the whole problem above can be mitigated if the hypervisor -> guest VM packet descriptors do not span across cachelines; or if the NIC is configured to ignore L4 information in Rx DMA ring hashing, or if the L4 fields identify the guest VM instead of containing entropy from the inner-flow, or if the guest VM has multiple different bounce-buffers for packet receipt, at least one per hypervisor CPU that might be handling Rx network traffic, etc. Entropy data for LAG/ECMP is good and necessary for the datacenter network, but its affect on the end-hosts should be considered and discussed. Such discussion may be helpful to software folks, NIC vendors, etc. as they decide to implement NVO3. -- Jeff S Wheeler <[email protected]> Sr Network Operator / Innovative Network Concepts _______________________________________________ nvo3 mailing list [email protected] https://www.ietf.org/mailman/listinfo/nvo3
