Erik Nordmark wrote: >Francesco DiMambro wrote: > > > >>I benchmarked MDT v LSO and MDT was better on our card but I attribute >>that to limited hardware capability, if the hardware can go up to 1M LSO >>them >>I'd expect LSO to overtake MDT, we're not there yet hence the need for MDT. >> >> > >With TCP how large chunks do you see passed down to the driver? It >shouldn't pass more than about 3 MTUs worth at a time in the normal TCP >operation. For LAN usage TCP will do more aggressive ACK suppression >hence you might see a larger multiple of MTUs in the LAN case. But >64k >chunks of TCP makes no sense (you mentioned 500 TCP packets in one >request). With what network and application would that appear? It might >be a bug in TCP... > >In any case, the benefit of going above 64kbyte is probably small since >there is per-byte overhead (copyin and copyout) and per-page overhead >(IOMMU, sendfile). > > > >>In the case of UDP there's no LSO in Solaris so the benchmark was against >>plain old one packet at a time model. With that MDT was twice as fast, but >>was limited by the receiver which couldn't deal with the volume of packets. >>(Working on that next.) >> >> > >The UDP MDT code only applies to fragmented packets since that is the >only case when a single write or sendto on a UDP socket results in >multiple packets going out. > >Version 2.4.4 of netperf (I haven't checked other versions) defaults to >a message size of 57344, which causes fragmentation. That was the >benchmark you ran (netperf -t UDP_STREAM). >Are you saying that a 57kbyte message size on UDP is representative of >some real applications? I certainly haven't seen such an application. > >NFS over UDP (back in the NFSv2 days) did default to an 8k read and >write size, but with NFSv3 (ten years ago?) and now NFSv4, NFS over UDP >is becoming part of history with less and less real usage. > >And UDP applications which operate over a wide area (DNS, video, voice >applications) avoid fragmentation as the plague since the retransmit >unit being different than the loss unit causes severe performance >degradation. See the "Fragmentation considered harmful" paper at e.g., >ftp://gatekeeper.research.compaq.com/pub/DEC/WRL/research-reports/WRL-TR-87.3.pdf > > > >>The complexity in the driver is equal to the complexity of the one >>packet at >>a time model. The difference is familiarity, in other words most people can >>write a one packet at a time Tx, and layer on top of that LSO, no >>brainer ;-). >> >> > >I wasn't just concerned about the complexity in the driver - I am >concerned about the total system complexity caused by the MDT >implementation. The amount of code that needs to know about M_MULTIDATA >is scary, and in many cases there are different code paths to deal with >those which makes understanding, supporting, and bug fixing much more >complex. > >Architecturally it makes more sense to have everything about GLD just >view everything as TCP LSO. In the case the hardware doesn't handle LSO >it is quite efficient to convert the LSO format to an "MDT format". By >this I mean take LSO's 'one TCP/IP header, one large payload' into >'multiple TCP/IP headers, separate payloads but on the same pages'. That >means you'd get the performance benefit of doing DMA/IOMMU setup for the >single large payload and page with N TCP/IP headers. >(Some refer to this as "soft LSO", but that term also includes the case >when the DMA/IOMMU handling is operating on the individual, small >packets and here I am talking about amortizing DMA/IOMMU handling the >same way as with MDT.) > > Erik >_______________________________________________ >networking-discuss mailing list >networking-discuss@opensolaris.org > > The Soft LSO approach the Neptune Linux driver takes is that it advertises as HW LSO capable. The LSO (for now TCP) segment will be DMA mapped per page and then chopped into MSS, within the driver. The relevant IP/TCP headers are manufactured in the driver. This has the benefit of that the stack doesn't differentiate between HW and SW LSO. Also, per page DMA mapping cuts down the # of mappings would have been needed if the stack have done the chopping. Our testing has showed that it gives much better throughput and CPU utilization than the case where the stack did the partition and sent a list of packets (GSO). The throughput approaches that of HW LSO and the cpu utilization is close to it as well. The drawback is that it is driver specific and each driver implementing SW LSO has to reimplement it to the specific HW. But then, the driver is emulating a HW functionality so it should be OK
Must my 2 cents ... matheos _______________________________________________ networking-discuss mailing list networking-discuss@opensolaris.org