On 08.12.2011 16:34, Luigi Rizzo wrote:
On Fri, Dec 09, 2011 at 12:11:50AM +1100, Lawrence Stewart wrote:
On 12/08/11 05:08, Luigi Rizzo wrote:
...
I ran a bunch of tests on the ixgbe (82599) using RELENG_8 (which
seems slightly faster than HEAD) using MTU=1500 and various
combinations of card capabilities (hwcsum,tso,lro), different window
sizes and interrupt mitigation configurations.

default latency is 16us, l=0 means no interrupt mitigation.
lro is the software implementation of lro (tcp_lro.c)
hwlro is the hardware one (on 82599). Using a window of 100 Kbytes
seems to give the best results.

Summary:

[snip]

- enabling software lro on the transmit side actually slows
   down the throughput (4-5Gbit/s instead of 8.0).
   I am not sure why (perhaps acks are delayed too much) ?
   Adding a couple of lines in tcp_lro to reject
   pure acks seems to have much better effect.

The tcp_lro patch below might actually be useful also for
other cards.

--- tcp_lro.c   (revision 228284)
+++ tcp_lro.c   (working copy)
@@ -245,6 +250,8 @@

         ip_len = ntohs(ip->ip_len);
         tcp_data_len = ip_len - (tcp->th_off<<   2) - sizeof (*ip);
+       if (tcp_data_len == 0)
+               return -1;      /* not on ack */


         /*

There is a bug with our LRO implementation (first noticed by Jeff
Roberson) that I started fixing some time back but dropped the ball on.
The crux of the problem is that we currently only send an ACK for the
entire LRO chunk instead of all the segments contained therein. Given
that most stacks rely on the ACK clock to keep things ticking over, the
current behaviour kills performance. It may well be the cause of the
performance loss you have observed.

I should clarify better.
First of all, i tested two different LRO implementations: our
"Software LRO" (tcp_lro.c), and the "Hardware LRO" which is implemented
by the 82599 (called RSC or receive-side-coalescing in the 82599
data sheets). Jack Vogel and Navdeep Parhar (both in Cc) can
probably comment on the logic of both.

In my tests, either SW or HW LRO on the receive side HELPED A LOT,
not just in terms of raw throughput but also in terms of system
load on the receiver. On the receive side, LRO packs multiple data
segments into one that is passed up the stack.

As you mentioned this also reduces the number of acks generated,
but not dramatically (consider, the LRO is bounded by the number
of segments received in the mitigation interval).
In my tests the number of reads() on the receiver was reduced by
approx a factor of 3 compared to the !LRO case, meaning 4-5 segment
merged by LRO. Navdeep reported some numbers for cxgbe with similar
numbers.

Using Hardware LRO on the transmit side had no ill effect.
Being done in hardware i have no idea how it is implemented.

Using Software LRO on the transmit side did give a significant
throughput reduction. I can't explain the exact cause, though it
is possible that between reducing the number of segments to the
receiver and collapsing ACKs that it generates, the sender starves.
But it could well be that it is the extra delay on passing up the ACKs
that limits performance.
Either way, since the HW LRO did a fine job, i was trying to figure
out whether avoiding LRO on pure acks could help, and the two-line
patch above did help.

Note, my patch was just a proof-of-concept, and may cause
reordering if a data segment is followed by a pure ack.
But this can be fixed easily, handling a pure ack as
an out-of-sequence packet in tcp_lro_rx().

                                     WIP patch is at:
http://people.freebsd.org/~lstewart/patches/misctcp/tcplro_multiack_9.x.r219723.patch

Jeff tested the WIP patch and it *doesn't* fix the issue. I don't have
LRO capable hardware setup locally to figure out what I've missed. Most
of the machines in my lab are running em(4) NICs which don't support
LRO, but I'll see if I can find something which does and perhaps
resurrect this patch.

LRO can always be done in software.  You can do it at driver, ether_input
or ip_input level.

a few comments:
1. i don't think it makes sense to send multiple acks on
    coalesced segments (and the 82599 does not seem to do that).
    First of all, the acks would get out with minimal spacing (ideally
    less than 100ns) so chances are that the remote end will see
    them in a single burst anyways. Secondly, the remote end can
    easily tell that a single ACK is reporting multiple MSS and
    behave as if an equivalent number of acks had arrived.

ABC (appropriate byte counting) gets in the way though.

2. i am a big fan of LRO (and similar solutions), because it can save
    a lot of repeated work when passing packets up the stack, and the
    mechanism becomes more and more effective as the system load increases,
    which is a wonderful property in terms of system stability.

    For this reason, i think it would be useful to add support for software
    LRO in the generic code (sys/net/if.c) so that drivers can directly use
    the software implementation even without hardware support.

It hurts on higher RTT links in the general case.  For LAN RTT's
it's good.

3. similar to LRO, it would make sense to implement a "software TSO"
    mechanism where the TCP sender pushes a large segment down to
    ether_output, and having code in if_ethersubr.c do the segmentation
    and checksum computation. This would save multiple traversals of
    the various layers on the stack recomputing essentially the same
    information on all segments.

All modern NIC's support hardware TSO.  There's little benefit in
having a parallel software implementation.  And then you run into
the mbuf chain copying issue further down the layer.  The win won't
be much.

--
Andre
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Reply via email to