Re: [networking-discuss] LRO Implementation.

Francesco DiMambro Tue, 03 Jun 2008 16:26:13 -0700

Hi Eric

Erik Nordmark wrote:
> Francesco DiMambro wrote:
>
>> I benchmarked MDT v LSO and MDT was better on our card but I attribute
>> that to limited hardware capability, if the hardware can go up to 1M 
>> LSO them
>> I'd expect LSO to overtake MDT, we're not there yet hence the need 
>> for MDT.
>
> With TCP how large chunks do you see passed down to the driver? It 
> shouldn't pass more than about 3 MTUs worth at a time in the normal 
> TCP operation. For LAN usage TCP will do more aggressive ACK 
> suppression hence you might see a larger multiple of MTUs in the LAN 
> case. But >64k chunks of TCP makes no sense (you mentioned 500 TCP 
> packets in one request). With what network and application would that 
> appear? It might be a bug in TCP...
>
> In any case, the benefit of going above 64kbyte is probably small 
> since there is per-byte overhead (copyin and copyout) and per-page 
> overhead (IOMMU, sendfile).
>
It's back to back 10G systems, with simple netperf, no sendfile. If it's 
a bug
them you share the bug with M$ which is sending a similar number of packets
to the driver in one shot, in NDIS 6.x. Plus LSO > 64K which is becoming
available also says it's reasonable to burst large amounts of data.
It's the larger copyin right? because we're talking Tx only here?
Doesn't sockfs trap only once on a system call and copyin all the complete
message/socket which ever smallest, or in the case of Zero copy map the
whole buffer in kernel? Then the problem is calls to the driver MDT does
1, LSO does 1M/64K, 16 stack traversals? The IOMMU setup is down in
the driver and it's so much cheaper to lock down 1M for DMA than amortize
the cost across all packets than to do one packet at a time, same applies
to 64K LSO's.
>> In the case of UDP there's no LSO in Solaris so the benchmark was 
>> against
>> plain old one packet at a time model. With that MDT was twice as 
>> fast, but
>> was limited by the receiver which couldn't deal with the volume of 
>> packets.
>> (Working on that next.)
>
> The UDP MDT code only applies to fragmented packets since that is the 
> only case when a single write or sendto on a UDP socket results in 
> multiple packets going out.
>
> Version 2.4.4 of netperf (I haven't checked other versions) defaults 
> to a message size of 57344, which causes fragmentation. That was the 
> benchmark you ran (netperf -t UDP_STREAM).
> Are you saying that a 57kbyte message size on UDP is representative of 
> some real applications? I certainly haven't seen such an application.
It's benchmarks that win deals....and by the note above what's the need for
UDP LSO? There's another benchmark RFC 2544 for UDP routing that
is turning out to be a big deal also. Still since we're on the subject 
of how
to use MDT more effectively for UDP.
When I was at Sun we talked to a Video transfer company, I think there name
is Kasenna, they wanted an interface to the driver to burst out multiple 
small
UDP video packets typically 512bytes a piece. They found MDT interesting
because it could do it, since it was a simple data structure that 
facilitate that
type of application. They wanted to bypass the Sun stack straight to the
driver and do there own UDP. MDT was new then and we(Sun) decided to hold
off on that, nevertheless there is application for fast UDP, which LSO 
cannot
address but MDT could have.
> NFS over UDP (back in the NFSv2 days) did default to an 8k read and 
> write size, but with NFSv3 (ten years ago?) and now NFSv4, NFS over 
> UDP is becoming part of history with less and less real usage.
That's present applications, and looking back, there are still uses of 
UDP in
the stock trading arena. They won't benefit from LSO, but could benefit from
packing messages to the driver in an MDT request. M$ already do this in all
versions of NDIS, but they don't have the addition of amortizing the cost of
DMA setup, it just means Solaris has the edge with it's Multi packet Tx
software interface, for as long as it lasts.
>
> And UDP applications which operate over a wide area (DNS, video, voice 
> applications) avoid fragmentation as the plague since the retransmit 
> unit being different than the loss unit causes severe performance 
> degradation. See the "Fragmentation considered harmful" paper at e.g.,
I know and LSO is useless there as well, so you have to be at least open 
to an
alternative, which allows multiple packets be sent to the driver? You 
can keep
trying to save an instruction here and there on the one packet at a time 
model
but you have to see that will be coming to an end soon??? Then what???
> ftp://gatekeeper.research.compaq.com/pub/DEC/WRL/research-reports/WRL-TR-87.3.pdf
>  
>
>
>> The complexity in the driver is equal to the complexity of the one 
>> packet at
>> a time model. The difference is familiarity, in other words most 
>> people can
>> write a one packet at a time Tx, and layer on top of that LSO, no 
>> brainer ;-).
>
> I wasn't just concerned about the complexity in the driver - I am 
> concerned about the total system complexity caused by the MDT 
> implementation.  The amount of code that needs to know about 
> M_MULTIDATA is scary, and in many cases there are different code paths 
> to deal with those which makes understanding, supporting, and bug 
> fixing much more complex.
Can't the attention them move away from killing it to understanding it and
addressing those concerns?  I'm not disagreeing about complexity, but
I'm finding it hard to by that as a reason to get rid of it.
>
> Architecturally it makes more sense to have everything about GLD just 
> view everything as TCP LSO. In the case the hardware doesn't handle 
> LSO it is quite efficient to convert the LSO format to an "MDT 
> format". By this I mean take LSO's 'one TCP/IP header, one large 
> payload' into 'multiple TCP/IP headers, separate payloads but on the 
> same pages'. That means you'd get the performance benefit of doing 
> DMA/IOMMU setup for the single large payload and page with N TCP/IP 
> headers.
It's not obvious that there's an architectural weakness in what exists 
now, just
the general claim of stack complexity, which seems down to implementation.
The driver can support an MDT interface, and the stacks can gradually be
enhanced to take advantage of it with no driver change to those implementing
MDT.
This is in line with GLD principles. LSO continues to exist and gets 
enhanced
by hardware vendors. As a driver you register the capability you 
want/support.
> (Some refer to this as "soft LSO", but that term also includes the 
> case when the DMA/IOMMU handling is operating on the individual, small 
> packets and here I am talking about amortizing DMA/IOMMU handling the 
> same way as with MDT.)
That's taken for granted in both LSO and MDT.


    thanks
    Frank
>
>    Erik

_______________________________________________
networking-discuss mailing list
[email protected]

Re: [networking-discuss] LRO Implementation.

Reply via email to