Re: [networking-discuss] LRO Implementation.

Erik Nordmark Tue, 03 Jun 2008 13:53:27 -0700

Francesco DiMambro wrote:

> I benchmarked MDT v LSO and MDT was better on our card but I attribute
> that to limited hardware capability, if the hardware can go up to 1M LSO 
> them
> I'd expect LSO to overtake MDT, we're not there yet hence the need for MDT.


With TCP how large chunks do you see passed down to the driver? It 
shouldn't pass more than about 3 MTUs worth at a time in the normal TCP 
operation. For LAN usage TCP will do more aggressive ACK suppression 
hence you might see a larger multiple of MTUs in the LAN case. But >64k 
chunks of TCP makes no sense (you mentioned 500 TCP packets in one 
request). With what network and application would that appear? It might 
be a bug in TCP...

In any case, the benefit of going above 64kbyte is probably small since 
there is per-byte overhead (copyin and copyout) and per-page overhead 
(IOMMU, sendfile).

> In the case of UDP there's no LSO in Solaris so the benchmark was against
> plain old one packet at a time model. With that MDT was twice as fast, but
> was limited by the receiver which couldn't deal with the volume of packets.
> (Working on that next.)

The UDP MDT code only applies to fragmented packets since that is the 
only case when a single write or sendto on a UDP socket results in 
multiple packets going out.

Version 2.4.4 of netperf (I haven't checked other versions) defaults to 
a message size of 57344, which causes fragmentation. That was the 
benchmark you ran (netperf -t UDP_STREAM).
Are you saying that a 57kbyte message size on UDP is representative of 
some real applications? I certainly haven't seen such an application.

NFS over UDP (back in the NFSv2 days) did default to an 8k read and 
write size, but with NFSv3 (ten years ago?) and now NFSv4, NFS over UDP 
is becoming part of history with less and less real usage.

And UDP applications which operate over a wide area (DNS, video, voice 
applications) avoid fragmentation as the plague since the retransmit 
unit being different than the loss unit causes severe performance 
degradation. See the "Fragmentation considered harmful" paper at e.g., 
ftp://gatekeeper.research.compaq.com/pub/DEC/WRL/research-reports/WRL-TR-87.3.pdf

> The complexity in the driver is equal to the complexity of the one 
> packet at
> a time model. The difference is familiarity, in other words most people can
> write a one packet at a time Tx, and layer on top of that LSO, no 
> brainer ;-).

I wasn't just concerned about the complexity in the driver - I am 
concerned about the total system complexity caused by the MDT 
implementation.  The amount of code that needs to know about M_MULTIDATA 
is scary, and in many cases there are different code paths to deal with 
those which makes understanding, supporting, and bug fixing much more 
complex.

Architecturally it makes more sense to have everything about GLD just 
view everything as TCP LSO. In the case the hardware doesn't handle LSO 
it is quite efficient to convert the LSO format to an "MDT format". By 
this I mean take LSO's 'one TCP/IP header, one large payload' into 
'multiple TCP/IP headers, separate payloads but on the same pages'. That 
means you'd get the performance benefit of doing DMA/IOMMU setup for the 
single large payload and page with N TCP/IP headers.
(Some refer to this as "soft LSO", but that term also includes the case 
when the DMA/IOMMU handling is operating on the individual, small 
packets and here I am talking about amortizing DMA/IOMMU handling the 
same way as with MDT.)

    Erik
_______________________________________________
networking-discuss mailing list
[email protected]

Re: [networking-discuss] LRO Implementation.

Reply via email to