Hi Tom, On Mon, 2014-03-24 at 14:33 -0700, Tom Herbert wrote: > On Mon, Mar 24, 2014 at 10:34 AM, Erik Nordmark <[email protected]> wrote: > > On 3/14/14 8:52 PM, Zhou, Han wrote: > >> > >> > >> Tom, the point of this draft is that the "last possible point in the > >> stack" can > >> be pushed to the remote end-point of the VXLAN tunnel. If the remote > >> is an hypervisor, this GSO is terminated without actual work: the > >> receiving hypervisor simply delivers the large packet to receiving guest. > > > > Han, > > > > Are you saying that the sending VM and vswitch will send a large packet > > (e.g., 32k) over UDP, and this will be delivered to the receiving vSwitch > > and VM as one large packet? That certainly makes the work on the VMs a lot > > less, hence I can understand that you see performance improvements. > > > > However, that would result in IP fragmentation of that large UDP/VXLAN > > packet AFAICT. > > > That would be UFO. It seems more likely that you'd want to do TSO/LRO > style L4 segmentation/reassembly. >
UFO would be ideal, but it is not available in most current NICs. And it is not the mechanism described in this draft. This proposal is a pure software solution without any hardware offloading required. Implementation can choose to utilize hardware UFO if available, and otherwise fallback to the mechanism proposed here (VXLAN-SOE). > It's still not clear to me on why the MTU would need to be on the > wire, offload mechanisms already work now without that information. > Also, for packets going from one VM to another within a host moving > jumbo packets could be directly linked into the TSO/LRO mechanisms. > If the remote VTEP is a gateway rather than a hypervisor, it would require the overlay MTU information for re-segmentation so that the correct MTU (may be a result of path-MTU negotiation) is applied. As mentioned in the draft (section 2.4), there are 3 choices on the gateway: 1. re-segment in gateway software 2. offload to in NIC hardware 3. offload to the next hop if it is a tunnel transport (e.g. VXLAN-GPE) and the tunnel protocol supports offloading, too (e.g. also VXLAN-SOE). Best regards, Han > > > The IETF has some experience with protocols where the loss unit is smaller > > than the retransmission unit, and this results in very poor performance > > under packet loss due to the loss of a single, small unit resulting in the > > retransmission of a large unit (the 32k packet, which might be a TCP, SCTP, > > etc packet i.e. a reliable protocol with retransmissions.) > > > > Regards, > > Erik > > > > > > > > > >> This results in a huge performance gain, especially in the receiving side > >> because the number of packets being handled are much smaller. The > >> speed up is 2x - 3x even when the physical network still transmits small > >> (1514 bytes) packets with IP fragmentation. If jumbo frames used in > >> physical network the speedup will be boosted even higher. I would like > >> to share more data of the prototype if it is of interest. > >> > >> So this is not a local mechanism, but need agreement between end-points. > >> In a setup like: > >> Hypervisor A <-- Hypervisor B --> Gateway > >> On hypervisor B the VTEP treat both remote VTEPs the same way: it fills > >> segmentation-offloading information for GSO packets. Hypervisor A > >> optionally checks such information to understand that this is a valid > >> large > >> packet and don't drop it even its size is bigger than the guest's virtual > >> interface. But this information is critical to the Gateway: it has to > >> perform > >> the real segmentation if the packet is being forwarded to physical > >> networks. > >> And this is why we need the metadata in the on-the-wire protocol. > >> > >>> - I believe this would conflict with the proposal to add a protocol > >>> field to the VXLAN header. Overloading one field in a fixed header is > >>> not an adequate substitute for a truly extensible header. In the best > >>> case we could only use one or the other functionality in a given > >>> packet. In the worse case, overloading opens the door to backwards > >>> compatibility issues and the potential for misinterpretation of > >>> fields. > >>> > >> I agree that we'd better avoid field overloading. But as stated in section > >> 3, > >> it is not conflict with VXLAN-gpe, because when segmentation offloading > >> is enabled the encapsulated packet should always be Ethernet, and in > >> such case prototype is not needed. But you reminded me that, it should > >> be defined clearly that P bit specified by VXLAN-gpe MUST be set to 0 > >> when S bit is 1. We will address that in version 01. > >> > >> Or if you know any real (or potential) scenarios of conflict between > >> VXLAN-soe and VXLAN-gpe, please kindly point out and we can consider > >> using the remaining space in the header instead of overloading it. > >> > >>> Tom > >>> > >>>> So this is a practical yet generic proposal, which extends the > >>>> offloading concept > >>>> to from kernel stacks to remote end-points of overlay networks. > >>>> > >>>> The metadata for offloading is very similar to STT. There difference is > >>>> that: > >>>> 1. it doesn’t add fake TCP header to utilize NIC TSO. > >>>> 2. it doesn't include helper fields - just to save the limited VXLAN > >>>> header space > >>> > >>> for > >>>> > >>>> other possible purpose in the future. > >>>> 3. VXLAN is widely adopted and this is only a minor extension backward > >>> > >>> compatible > >>>> > >>>> Based on this, it is highly recommended to add segmentation metadata in > >>> > >>> VXLAN > >>>> > >>>> header as proposed in this draft. > >>>> > >>>> Any comments are appreciated! > >>>> > >>>> Best regards, > >>>> Han Zhou > >>>> > >>>> -----Original Message----- > >>>> From: [email protected] [mailto:[email protected]] > >>>> Sent: Thursday, March 13, 2014 10:29 PM > >>>> To: Zhou, Han; Li, Chengyuan; Li, Chengyuan; Zhou, Han > >>>> Subject: New Version Notification for draft-zhou-li-vxlan-soe-00.txt > >>>> > >>>> > >>>> A new version of I-D, draft-zhou-li-vxlan-soe-00.txt > >>>> has been successfully submitted by Han Zhou and posted to the > >>>> IETF repository. > >>>> > >>>> Name: draft-zhou-li-vxlan-soe > >>>> Revision: 00 > >>>> Title: Segmentation Offloading Extension for VxLAN > >>>> Document date: 2014-03-13 > >>>> Group: Individual Submission > >>>> Pages: 7 > >>>> URL: > >>> > >>> http://www.ietf.org/internet-drafts/draft-zhou-li-vxlan-soe-00.txt > >>>> > >>>> Status: > >>>> https://datatracker.ietf.org/doc/draft-zhou-li-vxlan-soe/ > >>>> Htmlized: http://tools.ietf.org/html/draft-zhou-li-vxlan-soe-00 > >>>> > >>>> > >>>> Abstract: > >>>> Segmentation offloading is nowadays common in network stack > >>>> implementation and well supported by para-virtualized network device > >>>> drivers for virtual machine (VM)s. This draft describes an extension > >>>> to Virtual eXtensible Local Area Network (VXLAN) so that > >>>> segmentation > >>>> can be decoupled from physical/underlay networks and offloaded > >>>> further to the remote end-point thus improving data-plane > >>>> performance > >>>> for VMs running on top of overlay networks. > >>>> > >>>> > >>>> > >>>> > >>>> Please note that it may take a couple of minutes from the time of > >>>> submission > >>>> until the htmlized version and diff are available at tools.ietf.org. > >>>> > >>>> The IETF Secretariat > >>>> > >>>> _______________________________________________ > >>>> nvo3 mailing list > >>>> [email protected] > >>>> https://www.ietf.org/mailman/listinfo/nvo3 > >> > >> _______________________________________________ > >> nvo3 mailing list > >> [email protected] > >> https://www.ietf.org/mailman/listinfo/nvo3 > >> > > _______________________________________________ nvo3 mailing list [email protected] https://www.ietf.org/mailman/listinfo/nvo3
