Re: [PATCH] LRO ack aggregation

2007-11-20 Thread David Miller
From: Andrew Gallatin [EMAIL PROTECTED]
Date: Tue, 20 Nov 2007 06:47:57 -0500

 David Miller wrote:
   From: Herbert Xu [EMAIL PROTECTED]
   Date: Tue, 20 Nov 2007 14:09:18 +0800
  
   David Miller [EMAIL PROTECTED] wrote:
   Fundamentally, I really don't like this change, it batches to the
   point where it begins to erode the natural ACK clocking of TCP, and I
   therefore am very likely to revert it before merging to Linus.
   Perhaps make it a tunable that defaults to off?
  
   That's one idea.
 
 I'd certainly prefer the option to have a tunable to having our
 customers see performance regressions when they switch to
 the kernel's LRO.

Please qualify this because by itself it's an inaccurate statement.

It would cause a performance regression in situations where the is
nearly no packet loss, no packet reordering, and the receiver has
strong enough cpu power.

Because in fact, some customers might see performance regressions
by using this ack aggregation code.  In particular if they are
talking to the real internet at all.

 If vendors are going to pick this up, there is the risk of them just
 applying this patch (which currently has no tunable to disable it),
 leaving their users stuck with it enabled.

I will watch out for this and make sure to advise them strongly not to
do this.

So you can be sure this won't happen. :-)

 FWIW, we've seen TCP perform well in a WAN setting using our NICs and
 our LRO which does this ack aggregation.  For example, the last 2
 Supercomputing Bandwidth Challenges (making the most of a 10Gb/s
 WAN connection) were won by teams using our NICs, with drivers that
 did this sort of ack aggregation.

And basically nearly no packet loss, which just supports my objections
to this even more.  ANd this doesn't even begin to consider the RX
cpu limited cases, where again ACK stretching hurts a lot.

The bandwidth challenge cases are not very realistic at all, and are
about as far from the realities of real internet traffic as you can
get.

Show me something over real backbones, talking to hundres or thousands
of clients scattered all over the world.  That's what people will be
using these high end NICs for front facing services, and that's where
loss happens and stretch ACKs hurt performance.

ACK stretching is bad bad bad for everything outside of some well
controlled test network bubble.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LRO ack aggregation

2007-11-20 Thread Andrew Gallatin

David Miller wrote:
 From: Herbert Xu [EMAIL PROTECTED]
 Date: Tue, 20 Nov 2007 14:09:18 +0800

 David Miller [EMAIL PROTECTED] wrote:
 Fundamentally, I really don't like this change, it batches to the
 point where it begins to erode the natural ACK clocking of TCP, and I
 therefore am very likely to revert it before merging to Linus.
 Perhaps make it a tunable that defaults to off?

 That's one idea.

I'd certainly prefer the option to have a tunable to having our
customers see performance regressions when they switch to
the kernel's LRO.

 But if it's there the risk it to have it end up being turned on
 always by distribution vendors, making our off-default pointless
 and the internet gets crapped up with misbehaving Linux TCP
 stacks anyways.

If vendors are going to pick this up, there is the risk of them just
applying this patch (which currently has no tunable to disable it),
leaving their users stuck with it enabled.  At least with a tunable,
it would be easy for them to turn it off.  And the comments surrounding
it could make it clear why it should default to off.

FWIW, we've seen TCP perform well in a WAN setting using our NICs and
our LRO which does this ack aggregation.  For example, the last 2
Supercomputing Bandwidth Challenges (making the most of a 10Gb/s
WAN connection) were won by teams using our NICs, with drivers that
did this sort of ack aggregation.

Drew
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LRO ack aggregation

2007-11-20 Thread Evgeniy Polyakov
Hi.

On Tue, Nov 20, 2007 at 08:27:05AM -0500, Andrew Gallatin ([EMAIL PROTECTED]) 
wrote:
 Hmm.. rather than a global tunable, what if it was a
 network driver managed tunable which toggled a flag in the
 lro_mgr features?  Would that be better?

What about ethtool control to set LRO_simple and LRO_ACK_aggregation?

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LRO ack aggregation

2007-11-20 Thread Andrew Gallatin

David Miller wrote:
 From: Andrew Gallatin [EMAIL PROTECTED]
 Date: Tue, 20 Nov 2007 06:47:57 -0500

 David Miller wrote:
   From: Herbert Xu [EMAIL PROTECTED]
   Date: Tue, 20 Nov 2007 14:09:18 +0800
  
   David Miller [EMAIL PROTECTED] wrote:
   Fundamentally, I really don't like this change, it batches to the
   point where it begins to erode the natural ACK clocking of TCP, 
and I

   therefore am very likely to revert it before merging to Linus.
   Perhaps make it a tunable that defaults to off?
  
   That's one idea.

 I'd certainly prefer the option to have a tunable to having our
 customers see performance regressions when they switch to
 the kernel's LRO.

 Please qualify this because by itself it's an inaccurate statement.

 It would cause a performance regression in situations where the is
 nearly no packet loss, no packet reordering, and the receiver has
 strong enough cpu power.

Yes, a regression of nearly 1Gb/s in some cases as I mentioned
when I submitted the patch.



 Show me something over real backbones, talking to hundres or thousands
 of clients scattered all over the world.  That's what people will be
 using these high end NICs for front facing services, and that's where
 loss happens and stretch ACKs hurt performance.


I can't.  I think most 10GbE on endstations is used either in the
sever room, or on dedicated links.  My experience with 10GbE users is
limited to my interactions with people using our NICs who contact our
support.  Of those, I can recall only a tiny handful who were using
10GbE on a normal internet facing connection (and the ones I dealt
with were actually running a different OS).  The vast majority were in
a well controlled, lossless environment.  It is quite ironic.  The
very fact that I cannot provide you with examples of internet facing
people using LRO (w/ack aggr) in more normal applications tends to
support my point that most 10GbE users seem to be in lossless
environments.

 ACK stretching is bad bad bad for everything outside of some well
 controlled test network bubble.

I just want those in the bubble to continue have the best performance
possible in their situation.  If it is a tunable the defaults to off,
that is great.

Hmm.. rather than a global tunable, what if it was a
network driver managed tunable which toggled a flag in the
lro_mgr features?  Would that be better?

Drew
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LRO ack aggregation

2007-11-20 Thread Evgeniy Polyakov
On Tue, Nov 20, 2007 at 09:50:56PM +0800, Herbert Xu ([EMAIL PROTECTED]) wrote:
 On Tue, Nov 20, 2007 at 04:35:09PM +0300, Evgeniy Polyakov wrote:
  
  On Tue, Nov 20, 2007 at 08:27:05AM -0500, Andrew Gallatin ([EMAIL 
  PROTECTED]) wrote:
   Hmm.. rather than a global tunable, what if it was a
   network driver managed tunable which toggled a flag in the
   lro_mgr features?  Would that be better?
  
  What about ethtool control to set LRO_simple and LRO_ACK_aggregation?
 
 I have two concerns about this:
 
 1) That same option can still be turned on by distros.

FC and Debian turn on hardware checksumm offloading in e1000 and I have
a card where this results in more than 10% performance _decrease_.
I do not know why, but Im able to run script which disables it via
ethtool.

 2) This doesn't make sense because the code is actually in the
core networking stack.

It depends. Software lro can be controlled by simple procfs switch, but
hardware one? I recall it was number of times pointed that hardware LRO
is possible and likely being implemented in some asics.

 I'm particular unhappy about 2) because I don't want be in a
 situation down the track where every driver is going to add this
 option so that they're not left behind in the arms race.

For software lro I agree, but this looks exactly like gso/tso case and
additional tweak for software gso. Having it per-system is fine, and I
believe no one should ever care that some distro will do bad/good things
with it. Actually we do have so much tricky options in procfs already
which can kill performance...

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LRO ack aggregation

2007-11-20 Thread Herbert Xu
On Tue, Nov 20, 2007 at 04:35:09PM +0300, Evgeniy Polyakov wrote:
 
 On Tue, Nov 20, 2007 at 08:27:05AM -0500, Andrew Gallatin ([EMAIL PROTECTED]) 
 wrote:
  Hmm.. rather than a global tunable, what if it was a
  network driver managed tunable which toggled a flag in the
  lro_mgr features?  Would that be better?
 
 What about ethtool control to set LRO_simple and LRO_ACK_aggregation?

I have two concerns about this:

1) That same option can still be turned on by distros.
2) This doesn't make sense because the code is actually in the
   core networking stack.

I'm particular unhappy about 2) because I don't want be in a
situation down the track where every driver is going to add this
option so that they're not left behind in the arms race.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LRO ack aggregation

2007-11-20 Thread Herbert Xu
On Tue, Nov 20, 2007 at 05:03:12PM +0300, Evgeniy Polyakov wrote:
 
 For software lro I agree, but this looks exactly like gso/tso case and
 additional tweak for software gso. Having it per-system is fine, and I
 believe no one should ever care that some distro will do bad/good things
 with it. Actually we do have so much tricky options in procfs already
 which can kill performance...

Right, if you're doing it such that the same option automatically
shows up for every driver that uses software LRO then my second
concern goes away.

Of course we still have the problem with the option in general
that Dave raised.  That is this may cause the proliferation of
TCP receiver behaviour that may be undesirable.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LRO ack aggregation

2007-11-20 Thread Evgeniy Polyakov
On Tue, Nov 20, 2007 at 10:08:31PM +0800, Herbert Xu ([EMAIL PROTECTED]) wrote:
 Of course we still have the problem with the option in general
 that Dave raised.  That is this may cause the proliferation of
 TCP receiver behaviour that may be undesirable.

Yes, it results in bursts of traffic because of delayed acks accumulated
in sender's lro engine, but from the first point, if receiver is slow,
then it will slowly send acks and they will be slowly accumulated, thus
changing not only seq/ack numbers, but also timings, which is equal to
increasing length of the pipe between users. TCP is able to balance on 
this edge. I'm sure it depends on workload, but heavy bulk transfers,
where only lro with and without ack agregation can win, are quite usual
on long pipes with high performance numbers.

Until it is tested, I doubt it is possible to say it is 100% good or
bad, so my proposal is to write the code, which is tunable from
userspace, turn it off and allow people to test the change.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LRO ack aggregation

2007-11-20 Thread Rick Jones

David Miller wrote:

From: Andrew Gallatin [EMAIL PROTECTED]
Date: Tue, 23 Oct 2007 11:11:55 -0400



I've attached a patch which adds support to inet_lro for aggregating
pure acks.



I've applied this patch to net-2.6.25... but!

This needs some serious thinking.  What this patch ends up doing is creating
big stretch-ACKs, and those can hurt performance.

Stretch ACKs are particularly harmful when either the receiver is cpu
weak (lacking enough cpu power to fill the pipe completely no matter
what optimizations are applied) or when there is packet loss (less
feedback information and ACK clocking).

It also means that the sender will be more bursty, because it will now
swallow ACKs covering huge portions of the send window, and then have
large chunks of it's send queue it can send out all at once.

Fundamentally, I really don't like this change, it batches to the
point where it begins to erode the natural ACK clocking of TCP, and I
therefore am very likely to revert it before merging to Linus.


Sounds like one might as well go ahead and implement HP-UX/Solaris-like 
ACK sending avoidance at the receiver and not bother with LRO-ACK on the 
sender.


In some experiements a while back I thought I saw that LRO on the 
receiver was causing him to send fewer ACKs already?  IIRC that was with 
a Myricom card, perhaps I was fooled by it's own ACK LRO it was doing.


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LRO ack aggregation

2007-11-20 Thread David Miller
From: Rick Jones [EMAIL PROTECTED]
Date: Tue, 20 Nov 2007 11:45:54 -0800

 Sounds like one might as well go ahead and implement HP-UX/Solaris-like 
 ACK sending avoidance at the receiver and not bother with LRO-ACK on the 
 sender.
 
 In some experiements a while back I thought I saw that LRO on the 
 receiver was causing him to send fewer ACKs already?  IIRC that was with 
 a Myricom card, perhaps I was fooled by it's own ACK LRO it was doing.

Linux used to do aggressive ACK deferral, especially when the ucopy
code paths triggered.

I removed that code because I had several scenerios where it hurt more
than it helped performance, and the IETF has explicitly stated in
several documents the (proven) perils of such stretch ACKs.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LRO ack aggregation

2007-11-20 Thread Bill Fink
On Tue, 20 Nov 2007, Andrew Gallatin wrote:

 David Miller wrote:
   From: Andrew Gallatin [EMAIL PROTECTED]
   Date: Tue, 20 Nov 2007 06:47:57 -0500
  
   David Miller wrote:
 From: Herbert Xu [EMAIL PROTECTED]
 Date: Tue, 20 Nov 2007 14:09:18 +0800

 David Miller [EMAIL PROTECTED] wrote:
 Fundamentally, I really don't like this change, it batches to the
 point where it begins to erode the natural ACK clocking of TCP, 
 and I
 therefore am very likely to revert it before merging to Linus.

I have mixed feelings about this topic.  In general I agree with the
importance of maintaining the natural ACK clocking of TCP for normal
usage.  But there may also be some special cases that could benefit
significantly from such a new LRO pure ACK aggregation feature.  The
rest of my comments are in support of such a new feature, although
I haven't completely made up my own mind yet about the tradeoffs
involved in implementing such a new capability (good arguments are
being made on both sides).

 Perhaps make it a tunable that defaults to off?

 That's one idea.
  
   I'd certainly prefer the option to have a tunable to having our
   customers see performance regressions when they switch to
   the kernel's LRO.
  
   Please qualify this because by itself it's an inaccurate statement.
  
   It would cause a performance regression in situations where the is
   nearly no packet loss, no packet reordering, and the receiver has
   strong enough cpu power.

You are basically describing the HPC universe, which while not the
multitudes of the general Internet, is a very real and valid special
community of interest where maximum performance is critical.

For example, we're starting to see dynamic provisioning of dedicated
10-GigE lambda paths to meet various HPC requirements, just for the
purpose of insuring nearly no packet loss, no packet reordering.
See for example Internet2's Dynamic Circuit Network (DCN).

In the general Internet case, many smaller flows tend to be aggregated
together up to perhaps a 10-GigE interface, while in the HPC universe,
there tend to be fewer, but much higher individual bandwidth flows.
But both are totally valid usage scenarios.  So a tunable that defaults
to off for the general case makes sense to me.

 Yes, a regression of nearly 1Gb/s in some cases as I mentioned
 when I submitted the patch.

Which is a significant performance penalty.  But the CPU savings may
be an even more important benefit.

 
 
   Show me something over real backbones, talking to hundres or thousands
   of clients scattered all over the world.  That's what people will be
   using these high end NICs for front facing services, and that's where
   loss happens and stretch ACKs hurt performance.

The HPC universe uses real backbones, just not the general Internet
backbones.  Their backbones are engineered to have the characteristics
required for enabling very high performance applications.

And if performance would take a hit in the general Internet 10-GigE
server case, and that's clearly documented and understood, I don't
see what incentive the distros would have to enable the tunable for
their normal users, since why would they want to cause poorer
performance relative to other distros that stuck with the recommended
default.  The special HPC users could easily enable the option if it
was desired and proven beneficial in their environment.

 I can't.  I think most 10GbE on endstations is used either in the
 sever room, or on dedicated links.  My experience with 10GbE users is
 limited to my interactions with people using our NICs who contact our
 support.  Of those, I can recall only a tiny handful who were using
 10GbE on a normal internet facing connection (and the ones I dealt
 with were actually running a different OS).  The vast majority were in
 a well controlled, lossless environment.  It is quite ironic.  The
 very fact that I cannot provide you with examples of internet facing
 people using LRO (w/ack aggr) in more normal applications tends to
 support my point that most 10GbE users seem to be in lossless
 environments.

Most use of 10-GigE that I'm familiar with is related to the HPC
universe, but then that's the environment I work in.  I'm sure that
over time the use of 10-GigE in general Internet facing servers
will predominate, since that's where the great mass of users is.
But I would argue that that doesn't make it the sole usage arena
that matters.

   ACK stretching is bad bad bad for everything outside of some well
   controlled test network bubble.

It's not just for network bubbles.  That's where the technology tends
to first be shaken out, but the real goal is use in real-world,
production HPC environments.

 I just want those in the bubble to continue have the best performance
 possible in their situation.  If it is a tunable the defaults to off,
 that is great.

I totally agree, and think that the tunable (defaulting to off),
allows both the general 

Re: [PATCH] LRO ack aggregation

2007-11-19 Thread David Miller
From: Andrew Gallatin [EMAIL PROTECTED]
Date: Tue, 23 Oct 2007 11:11:55 -0400

 I've attached a patch which adds support to inet_lro for aggregating
 pure acks.

I've applied this patch to net-2.6.25... but!

This needs some serious thinking.  What this patch ends up doing is creating
big stretch-ACKs, and those can hurt performance.

Stretch ACKs are particularly harmful when either the receiver is cpu
weak (lacking enough cpu power to fill the pipe completely no matter
what optimizations are applied) or when there is packet loss (less
feedback information and ACK clocking).

It also means that the sender will be more bursty, because it will now
swallow ACKs covering huge portions of the send window, and then have
large chunks of it's send queue it can send out all at once.

Fundamentally, I really don't like this change, it batches to the
point where it begins to erode the natural ACK clocking of TCP, and I
therefore am very likely to revert it before merging to Linus.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LRO ack aggregation

2007-11-19 Thread David Miller
From: Herbert Xu [EMAIL PROTECTED]
Date: Tue, 20 Nov 2007 14:09:18 +0800

 David Miller [EMAIL PROTECTED] wrote:
 
  Fundamentally, I really don't like this change, it batches to the
  point where it begins to erode the natural ACK clocking of TCP, and I
  therefore am very likely to revert it before merging to Linus.
 
 Perhaps make it a tunable that defaults to off?

That's one idea.

But if it's there the risk it to have it end up being turned on
always by distribution vendors, making our off-default pointless
and the internet gets crapped up with misbehaving Linux TCP
stacks anyways.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LRO ack aggregation

2007-11-19 Thread Herbert Xu
David Miller [EMAIL PROTECTED] wrote:

 Fundamentally, I really don't like this change, it batches to the
 point where it begins to erode the natural ACK clocking of TCP, and I
 therefore am very likely to revert it before merging to Linus.

Perhaps make it a tunable that defaults to off?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] LRO ack aggregation

2007-10-23 Thread Andrew Gallatin

Hi,

We recently did some performance comparisons between the new inet_lro
LRO support in the kernel, and our Myri10GE in-driver LRO.
For receive, we found they were nearly identical.  However, for
transmit, we found that Myri10GE's LRO shows much lower CPU
utilization.  We traced the CPU utilization difference to our driver
LRO aggregating TCP acks, and the inet_lro module not doing so.

I've attached a patch which adds support to inet_lro for aggregating
pure acks.  Aggregating pure acks (segments with TCP_PAYLOAD_LENGTH ==
0) entails freeing the skb (or putting the page in the frags case).
The patch also handles trimming (typical for 54-byte pure ack frames
which have been padded to the ethernet minimum 60 byte frame size).
In the frags case, I tried to keep things simple by only doing the
trim when the entire frame fits in the first frag.  To be safe, I
ensure that the padding is all 0 (or, more exactly, was some pattern
whose checksum is -0) so that it doesn't impact hardware checksums.

This patch also fixes a small bug in the skb LRO path dealing with
vlans that I found when doing my own testing.  Specifically, in the
desc-active case, the existing code always fails the
lro_tcp_ip_check() for NICs without LRO_F_EXTRACT_VLAN_ID, because it
 fails to subtract the vlan_hdr_len from the skb-len.

Jan-Bernd Themann ([EMAIL PROTECTED]) has tested the patch using
the eHEA driver (skb codepath), and I have tested it using Myri10GE
(both frags and skb codepath).

Using a pair of identical low-end 2.0GHz Athlon 64 x2 3800+ with
Myri10GE 10GbE NICs, I ran 10 iterations of netperf TCP_SENDFILE
tests, taking the median run for comparison purposes.  The receiver
was running Myri10GE + patched inet_lro:

TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to rome-my
(192.168.1.16) port 0 AF_INET : cpu bind
Recv   SendSend  Utilization   Service
Demand
Socket Socket  Message  Elapsed  Send Recv Send Recv
Size   SizeSize Time Throughput  localremote   local
remote
bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB us/KB

Myri10GE driver-specific LRO:
  87380  65536  6553660.02  9442.65   16.2469.310.282 1.203
Myri10GE + unpatched inet_lro:
  87380  65536  6553660.02  9442.88   20.1069.110.349 1.199
Myri10GE + patched inet_lro:
  87380  65536  6553660.02  9443.30   16.9568.970.294 1.197

The important bit here is the sender's CPU utilization, and service
demand (cost per byte).  As you can see, without aggregating ack's,
the overhead on the sender is roughly 20% higher, even when sending to
a receiver which uses LRO.  The differences are even more dramatic
when sending to a receiver which does not use LRO (and hence sends
more frequent acks).

Below is the same benchmark, run between a pair of 4-way 3.0GHz Xeon
5160 machines (Dell 2950) with Myri10GE NICs.  The receiver is running
Solaris 10U4, which does not do LRO, and is acking at approximately
8:1 (or ~100K acks/sec):

Myri10GE driver-specific LRO:
196712  65536  6553660.01  9280.09   7.14 45.370.252   1.602
Myri10GE + unpatched inet_lro:
196712  65536  6553660.01  8530.80   10.5144.600.404   1.713
Myri10GE + patched inet_lro:
196712  65536  6553660.00  9249.65   7.21 45.900.255   1.626

Signed off by: Andrew Gallatin [EMAIL PROTECTED]


Andrew Gallatin
Myricom Inc.

diff --git a/net/ipv4/inet_lro.c b/net/ipv4/inet_lro.c
index ac3b1d3..eba145b 100644
--- a/net/ipv4/inet_lro.c
+++ b/net/ipv4/inet_lro.c
@@ -58,9 +58,6 @@ static int lro_tcp_ip_check(struct iphdr
if (ntohs(iph-tot_len) != len)
return -1;
 
-   if (TCP_PAYLOAD_LENGTH(iph, tcph) == 0)
-   return -1;
-
if (iph-ihl != IPH_LEN_WO_OPTIONS)
return -1;
 
@@ -223,6 +220,11 @@ static void lro_add_packet(struct net_lr
 
lro_add_common(lro_desc, iph, tcph, tcp_data_len);
 
+   if (tcp_data_len == 0) {
+   dev_kfree_skb_any(skb);
+   return;
+   }
+
skb_pull(skb, (skb-len - tcp_data_len));
parent-truesize += skb-truesize;
 
@@ -244,6 +246,11 @@ static void lro_add_frags(struct net_lro
 
lro_add_common(lro_desc, iph, tcph, tcp_data_len);
 
+   if (tcp_data_len == 0) {
+   put_page(skb_frags[0].page);
+   return;
+   }
+
skb-truesize += truesize;
 
skb_frags[0].page_offset += hlen;
@@ -338,6 +345,8 @@ static int __lro_proc_skb(struct net_lro
struct tcphdr *tcph;
u64 flags;
int vlan_hdr_len = 0;
+   int pkt_len;
+   int trim;
 
if (!lro_mgr-get_skb_header
|| lro_mgr-get_skb_header(skb, (void *)iph, (void *)tcph,
@@ -355,6 +364,17 @@ static int __lro_proc_skb(struct net_lro
 !test_bit(LRO_F_EXTRACT_VLAN_ID, lro_mgr-features))
vlan_hdr_len = VLAN_HLEN;
 
+   /* strip padding