Re: [PATCH 0/4][RFC] lro: Generic Large Receive Offload for TCP traffic

2007-07-31 Thread Jan-Bernd Themann
Hi,

Thanks for finding these bugs! I'll post an updated version soon (2 patches
with no separate Kconfig patches, one LRO and one eHEA patch). See comments 
below.

Thanks,
Jan-Bernd

On Monday 30 July 2007 22:32, Andrew Gallatin wrote:
 I was working on testing the myri10ge patch, and I ran into a few
 problems.  I've attached a patch to inet_lro.c to fix some of them,
 and a patch to myri10ge.c to show how to use the page based
 interface.  Both patches are signed off by Andrew Gallatin
 [EMAIL PROTECTED]
 
 First, the LRO_MAX_PG_HLEN is still a problem.  Minimally sized 60
 byte frames still cause problems in lro_gen_skb due to skb-len
 going negative.  Fixed in the attached patch.  It may be simpler
 to just drop LRO_MAX_PG_HLEN to ETH_ZLEN, but I'm not sure if
 that is enough.  Are there smart NICs which might chop off padding
 themselves?

I'd tend to stick to an explicit check as implemented in your patch
for now

 
 Second, you still need to set skb-ip_summed = CHECKSUM_UNNECESSARY
 when modified packets are flushed, else the stack will see bad
 checksums for packets from CHECKSUM_COMPLETE drivers using the
 skb interface.  Fixed in the attached patch.

I thought about it... As we do update the TCP checksum for aggregated
packets we could add a second ip_summed field in the net_lro_mgr struct 
used for aggregated packets to support HW that does not have any checksum helper
functionality. These drivers could set this ip_summed field to CHECKSUM_NONE, 
and thus leave the checksum check to the stack. I'm not sure if these old 
devices benefit
a lot from LRO. So what do you think?

 
 Fourth, I did some traffic sniffing to try to figure out what's going
 on above, and saw tcpdump complain about bad checksums.  Have you tried
 running tcpdump -s 65535 -vvv?  Have you also seen bad checksums?
 I seem to see this for both page- and skb-based versions of the driver.
 

Hmmm, can't confirm that. For our skb-based version I see
correct checksums for aggregated packets and for the page-based version as well.
I used: (tcpdump -i ethX -s 0 -w dump.bin) in combination with ethereal.
Don't see problems as well with your tcpdump command.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4][RFC] lro: Generic Large Receive Offload for TCP traffic

2007-07-31 Thread Andrew Gallatin

Jan-Bernd Themann wrote:

On Monday 30 July 2007 22:32, Andrew Gallatin wrote:



Second, you still need to set skb-ip_summed = CHECKSUM_UNNECESSARY
when modified packets are flushed, else the stack will see bad
checksums for packets from CHECKSUM_COMPLETE drivers using the
skb interface.  Fixed in the attached patch.


I thought about it... As we do update the TCP checksum for aggregated
packets we could add a second ip_summed field in the net_lro_mgr struct 
used for aggregated packets to support HW that does not have any checksum helper
functionality. These drivers could set this ip_summed field to CHECKSUM_NONE, 
and thus leave the checksum check to the stack. I'm not sure if these old devices benefit

a lot from LRO. So what do you think?


This might be handy, and it would also fix the problem with
CHECKSUM_PARTIAL drivers using the skb interface by allowing them
to set it to CHECKSUM_UNNECESSARY.



Fourth, I did some traffic sniffing to try to figure out what's going
on above, and saw tcpdump complain about bad checksums.  Have you tried
running tcpdump -s 65535 -vvv?  Have you also seen bad checksums?
I seem to see this for both page- and skb-based versions of the driver.



Hmmm, can't confirm that. For our skb-based version I see
correct checksums for aggregated packets and for the page-based version as well.
I used: (tcpdump -i ethX -s 0 -w dump.bin) in combination with ethereal.
Don't see problems as well with your tcpdump command.


I'm still trying to get a handle on this.  It happens both with
page based and skb based receive for me..  I would not be
surprised if I was doing something wrong in myri10ge.  But
I don't see it without LRO, or with my LRO.  I'll let you
know when I figure it out..

In the meantime, in case you have any insight, I've left a
capture of a small netcat transfer of a 64KB file full
of zeros at http://www.myri.com/staff/gallatin/lro/netcat_dump.bz2

Drew

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4][RFC] lro: Generic Large Receive Offload for TCP traffic

2007-07-30 Thread Jeff Garzik
Seems pretty good to me, save for one minor detail:  patches #1/#2 
should be combined together for greater git-bisect happiness.  Ditto for 
patches #3/#4.  Largely harmless in this case, but keeps the git history 
pollution to a minimum.


Caveat reviewer:  I'm not an expert of net/ipv4/* code, so I reviewed 
largely from the driver API perspective.


David, thoughts on merging?  I'm not We could stick this into your tree 
or mine.  Whether yours or mine, I would like to keep the driver and 
net-core patches together in the same git tree.


Jeff



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4][RFC] lro: Generic Large Receive Offload for TCP traffic

2007-07-30 Thread Linas Vepstas
On Mon, Jul 30, 2007 at 05:24:33PM +0200, Jan-Bernd Themann wrote:
 
 Changes to http://www.spinics.net/lists/netdev/msg36912.html
 
 1) A new field called features has been added to the net_lro_mgr struct.
It is set by the driver to indicate:
- LRO_F_NAPI:Use NAPI / netif_rx to pass packets to stack
 
- LRO_F_EXTRACT_VLAN_ID: Set by driver if HW extracts VLAN IDs for VLAN
 packets but does not modify ETH protocol (ETH_P_8021Q)
 
 2) Padded frames are not aggregated for now. Bug fixed
 
 3) Correct header length now used. No minimal header length for aggregated
packets used anymore.
 
 4) Statistic counters were introduced. They are stored in a new struct in
the net_lro_mgr. This has the advantage that no locking is required in
cases where the driver uses multiple lro_mgrs for different receive queues.
Thus we get the following statistics per lro_mgr / eth device:
- Number of aggregated packets
- Number of flushed packets
- Number of times we run out of lro_desc.
 
The ratio of aggregated packets and flushed packets give you an
idea how well LRO is working.

I'd like to see an edited form of this, together with an introduction to
LRO, written up in the Documentation subdirectory.  

As someone with some driver experience, but not on te bleeding edge,
some basc newbie questions pop into mind:

-- what is LRO?
-- Basic principles of operation?
-- Can I use it in my driver?  
-- Does my hardware have to have some special feature before I can use it?
-- What sort of performance improvement does it provide? Throughput?
   Latency? CPU usage? How does it affect DMA allocation? Does it 
   improve only a certain type of traffic (large/small packets, etc.)
-- Example code? What's the API? How should my driver use it?

Right now, I can maybe find answers by doing lots of googling.  I'd like
to have some quick way of getting a grip on this.

--linas
   
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4][RFC] lro: Generic Large Receive Offload for TCP traffic

2007-07-30 Thread Andrew Gallatin


Here is a quick reply before something more official can
be written up:

Linas Vepstas wrote:

 -- what is LRO?

Large Receive Offload

 -- Basic principles of operation?

LRO is analogous to a receive side version of TSO.  The NIC (or
driver) merges several consecutive segments from the same connection,
fixing up checksums, etc.  Rather than up to 45 separate 1500 byte
frames (meaning up to 45 trips through the network stack), the driver
merges them into one 65212 byte frame.  It currently works only
with TCP over IPv4.

LRO was, AFAIK, first though of by Neterion.  They had a paper about
it at OLS2005.
http://www.linuxinsight.com/files/ols2005/grossman-reprint.pdf

 -- Can I use it in my driver?

Yes, it can be used in any driver.

 -- Does my hardware have to have some special feature before I can 
use it?


No.

 -- What sort of performance improvement does it provide? Throughput?
Latency? CPU usage? How does it affect DMA allocation? Does it
improve only a certain type of traffic (large/small packets, etc.)

The benefit is directly proportional to the packet rate.

See my reply to the previous RFC for performance information.  The
executive summary is that for the myri10ge 10GbE driver on low end
hardware with 1500b frames, I've seen it increase throughput by a
factor of nearly 2.5x, while at the same time reducing CPU utilization
by 17%.  The affect for jumbo frames is less dramatic, but still
impressive (1.10x, 14% CPU reduction)

You can achieve better speedups if your driver receives into
high-order pages.

 -- Example code? What's the API? How should my driver use it?

The 3/4 in this patch showed an example of converting a driver
to use LRO for skb based receive buffers.   I'm working on
a patch for myri10ge that shows how you would use it in a driver
which receives into pages.

Cheers,

Drew
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4][RFC] lro: Generic Large Receive Offload for TCP traffic

2007-07-30 Thread Stephen Hemminger
Could someone add this to:
http://linux-net.osdl.org Wiki?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4][RFC] lro: Generic Large Receive Offload for TCP traffic

2007-07-30 Thread Andrew Gallatin

Jan-Bernd Themann wrote:

Hi,

this patch set contains the latest generic LRO code, a Kconfig / Makefile
and an eHEA patch demonstrating how the aggregate SKB interface has to
to be used.
Drew, could you provide a patch for the myri10ge driver to show how the
receive in page interface works?


Hi,

Thank you for the many fixes, and especially for the counters!

I was working on testing the myri10ge patch, and I ran into a few
problems.  I've attached a patch to inet_lro.c to fix some of them,
and a patch to myri10ge.c to show how to use the page based
interface.  Both patches are signed off by Andrew Gallatin
[EMAIL PROTECTED]

First, the LRO_MAX_PG_HLEN is still a problem.  Minimally sized 60
byte frames still cause problems in lro_gen_skb due to skb-len
going negative.  Fixed in the attached patch.  It may be simpler
to just drop LRO_MAX_PG_HLEN to ETH_ZLEN, but I'm not sure if
that is enough.  Are there smart NICs which might chop off padding
themselves?

Second, you still need to set skb-ip_summed = CHECKSUM_UNNECESSARY
when modified packets are flushed, else the stack will see bad
checksums for packets from CHECKSUM_COMPLETE drivers using the
skb interface.  Fixed in the attached patch.

Third, for some reason I have yet to figure out, this version of the
patch takes a while to ramp up, but only when the receiver MTU
is 9000 and the sender MTU is 1500.  If the receiver MTU is also
1500, even a 10 second netperf easily shows line rate.  I don't
see this with our LRO, and I'm so far at a loss to explain it.

Here is the first 3 seconds of  output from a simple program
which diffs the interface counters once / sec.
   Ipkts   IBytesOpkts   Obytes

rx MTU=9000:
   0000
  7299092   14 1102
 105   1589707  420
  750324   113598708417804  1068240
  814427   123304247818529  740



rx MTU=1500
   0000
  44134466818016813132   788182
  815938   123532865618716  1122960
  817698   123799477218628  1117680
.

Once it has ramped up, the bandwidth is fine, and there
doesn't seem to be much difference between setting the
receiver MTU to 1500 or 9000.  But it takes a very long
netperf run to overcome the ramp up period.

Fourth, I did some traffic sniffing to try to figure out what's going
on above, and saw tcpdump complain about bad checksums.  Have you tried
running tcpdump -s 65535 -vvv?  Have you also seen bad checksums?
I seem to see this for both page- and skb-based versions of the driver.

Last, the pointer to the TCP header in __lro_proc_segment() in the
unaccelerated vlan hdr case needs to also be offset by vlan_hdr_len
from skb-data.   I've fixed this in the attached patch.



Thanks,

Drew
--- linux-2.6.23-rc1.orig/net/ipv4/inet_lro.c   2007-07-30 13:10:49.0 
-0400
+++ linux-2.6.23-rc1/net/ipv4/inet_lro.c2007-07-30 15:17:51.0 
-0400
@@ -126,6 +126,7 @@ static void lro_update_tcp_ip_header(str
htons(lro_desc-ip_tot_len) -
IP_HDR_LEN(iph), IPPROTO_TCP,
lro_desc-data_csum);
+   lro_desc-parent-ip_summed = CHECKSUM_UNNECESSARY;
 }
 
 static __wsum lro_tcp_data_csum(struct iphdr *iph, struct tcphdr *tcph, int 
len)
@@ -396,6 +397,9 @@ static struct sk_buff *lro_gen_skb(struc
if (!skb)
return NULL;
 
+   if (hlen  len)
+   hlen = len;
+
skb-len = len;
skb-data_len = len - hlen;
skb-truesize += true_size;
diff -urp a/drivers/net/myri10ge/myri10ge.c b/drivers/net/myri10ge/myri10ge.c
--- a/drivers/net/myri10ge/myri10ge.c   2007-07-24 15:57:12.0 -0400
+++ b/drivers/net/myri10ge/myri10ge.c   2007-07-30 16:12:06.0 -0400
@@ -48,6 +48,7 @@
 #include linux/etherdevice.h
 #include linux/if_ether.h
 #include linux/if_vlan.h
+#include linux/inet_lro.h
 #include linux/ip.h
 #include linux/inet.h
 #include linux/in.h
@@ -62,6 +63,8 @@
 #include linux/io.h
 #include linux/log2.h
 #include net/checksum.h
+#include net/ip.h
+#include net/tcp.h
 #include asm/byteorder.h
 #include asm/io.h
 #include asm/processor.h
@@ -89,6 +92,7 @@ MODULE_LICENSE(Dual BSD/GPL);
 
 #define MYRI10GE_EEPROM_STRINGS_SIZE 256
 #define MYRI10GE_MAX_SEND_DESC_TSO ((65536 / 2048) * 2)
+#define MYRI10GE_MAX_LRO_DESCRIPTORS 8
 
 #define MYRI10GE_NO_CONFIRM_DATA htonl(0x)
 #define MYRI10GE_NO_RESPONSE_RESULT 0x
@@ -151,6 +155,8 @@ struct myri10ge_rx_done {
dma_addr_t bus;
int cnt;
int idx;
+   struct net_lro_mgr lro_mgr;
+   struct net_lro_desc lro_desc[MYRI10GE_MAX_LRO_DESCRIPTORS];
 };
 
 struct myri10ge_priv {
@@ -276,6 +282,14 @@ static int myri10ge_debug = -1;/* defau
 module_param(myri10ge_debug, int, 0);
 

Re: [PATCH 0/4][RFC] lro: Generic Large Receive Offload for TCP traffic

2007-07-30 Thread David Miller
From: Jeff Garzik [EMAIL PROTECTED]
Date: Mon, 30 Jul 2007 12:17:58 -0400

 David, thoughts on merging?  I'm not We could stick this into your tree 
 or mine.  Whether yours or mine, I would like to keep the driver and 
 net-core patches together in the same git tree.

No objections and I'll stick it into my net-2.6.24 tree once I cut
that, I'll have the NAPI stuff in there too so I'll keep these two
merge nightmares out of your hair :-)

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4][RFC] lro: Generic Large Receive Offload for TCP traffic

2007-07-30 Thread Jeff Garzik

David Miller wrote:

From: Jeff Garzik [EMAIL PROTECTED]
Date: Mon, 30 Jul 2007 12:17:58 -0400

David, thoughts on merging?  I'm not We could stick this into your tree 
or mine.  Whether yours or mine, I would like to keep the driver and 
net-core patches together in the same git tree.


No objections and I'll stick it into my net-2.6.24 tree once I cut
that, I'll have the NAPI stuff in there too so I'll keep these two
merge nightmares out of your hair :-)


Works for me.

Thanks,

Jeff



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 0/4][RFC] lro: Generic Large Receive Offload for TCP traffic

2007-07-30 Thread Leonid Grossman


 -Original Message-
 From: [EMAIL PROTECTED] [mailto:netdev-
 [EMAIL PROTECTED] On Behalf Of Andrew Gallatin
 Sent: Monday, July 30, 2007 10:43 AM
 To: Linas Vepstas
 Cc: Jan-Bernd Themann; netdev; Thomas Klein; Jeff Garzik; Jan-Bernd
 Themann; linux-kernel; linux-ppc; Christoph Raisch; Marcus Eder;
Stefan
 Roscher; David Miller
 Subject: Re: [PATCH 0/4][RFC] lro: Generic Large Receive Offload for
 TCP traffic
 
 
 Here is a quick reply before something more official can
 be written up:
 
 Linas Vepstas wrote:
 
   -- what is LRO?
 
 Large Receive Offload
 
   -- Basic principles of operation?
 
 LRO is analogous to a receive side version of TSO.  The NIC (or
 driver) merges several consecutive segments from the same connection,
 fixing up checksums, etc.  Rather than up to 45 separate 1500 byte
 frames (meaning up to 45 trips through the network stack), the driver
 merges them into one 65212 byte frame.  It currently works only
 with TCP over IPv4.
 
 LRO was, AFAIK, first though of by Neterion.  They had a paper about
 it at OLS2005.
 http://www.linuxinsight.com/files/ols2005/grossman-reprint.pdf
 
   -- Can I use it in my driver?
 
 Yes, it can be used in any driver.
 
   -- Does my hardware have to have some special feature before I can
 use it?
 
 No.

Ditto. LRO hw assists (or full LRO offload, meaning that the merge
happens in the ASIC but the merge criteria are set by the host) are
quite beneficial, especially as the number of LRO connections gets very
large (Neterion has IP around fw/hw LRO and will release full hw LRO
offload in the next ASIC), but as Andrew indicated sw-only LRO
implementation can be done for any NIC and provides good results -
especially for non-jumbo workloads. 

 
   -- What sort of performance improvement does it provide?
Throughput?
  Latency? CPU usage? How does it affect DMA allocation? Does it
  improve only a certain type of traffic (large/small packets,
 etc.)
 
 The benefit is directly proportional to the packet rate.
 
 See my reply to the previous RFC for performance information.  The
 executive summary is that for the myri10ge 10GbE driver on low end
 hardware with 1500b frames, I've seen it increase throughput by a
 factor of nearly 2.5x, while at the same time reducing CPU utilization
 by 17%.  The affect for jumbo frames is less dramatic, but still
 impressive (1.10x, 14% CPU reduction)
 
 You can achieve better speedups if your driver receives into
 high-order pages.
 
   -- Example code? What's the API? How should my driver use it?
 
 The 3/4 in this patch showed an example of converting a driver
 to use LRO for skb based receive buffers.   I'm working on
 a patch for myri10ge that shows how you would use it in a driver
 which receives into pages.
 
 Cheers,
 
 Drew
 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html