[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
On Mon, 27 Aug 2007, jamal wrote: On Sun, 2007-26-08 at 19:04 -0700, David Miller wrote: The transfer is much better behaved if we ACK every two full sized frames we copy into the receiver, and therefore don't stretch ACK, but at the cost of cpu utilization. The rx coalescing in theory should help by accumulating more ACKs on the rx side of the sender. But it doesnt seem to do that i.e For the 9K MTU, you are better off to turn off the coalescing if you want higher numbers. Also some of the TOE vendors (chelsio?) claim to have fixed this by reducing bursts on outgoing packets. Bill: who suggested (as per your email) the 75usec value and what was it based on measurement-wise? Belatedly getting back to this thread. There was a recent myri10ge patch that changed the default value for tx/rx interrupt coalescing to 75 usec claiming it was an optimum value for maximum throughput (and is also mentioned in their external README documentation). I also did some empirical testing to determine the effect of different values of TX/RX interrupt coalescing on 10-GigE network performance, both with TSO enabled and with TSO disabled. The actual test runs are attached at the end of this message, but the results are summarized in the following table (network performance in Mbps). TX/RX interrupt coalescing in usec (both sides) 0 15 30 45 60 75 90 105 TSO enabled 89099682971697259739974596889648 TSO disabled91139910991099109910991099109910 TSO disabled performance is always better than equivalent TSO enabled performance. With TSO enabled, the optimum performance is indeed at a TX/RX interrupt coalescing value of 75 usec. With TSO disabled, performance is the full 10-GigE line rate of 9910 Mbps for any value of TX/RX interrupt coalescing from 15 usec to 105 usec. BTW, thanks for the finding the energy to run those tests and a very refreshing perspective. I dont mean to add more work, but i had some queries; On your earlier tests, i think that Reno showed some significant differences on the lower MTU case over BIC. I wonder if this is consistent? Here's a retest (5 tests each): TSO enabled: TCP Cubic (initial_ssthresh set to 0): [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 5007.6295 MB / 10.06 sec = 4176.1807 Mbps 36 %TX 100 %RX [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 4950.9279 MB / 10.06 sec = 4130.2528 Mbps 36 %TX 99 %RX [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 4917.1742 MB / 10.05 sec = 4102.5772 Mbps 35 %TX 99 %RX [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 4948.7920 MB / 10.05 sec = 4128.7990 Mbps 36 %TX 100 %RX [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 4937.5765 MB / 10.05 sec = 4120.6460 Mbps 35 %TX 99 %RX TCP Bic (initial_ssthresh set to 0): [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 5005.5335 MB / 10.06 sec = 4172.9571 Mbps 36 %TX 99 %RX [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 5001.0625 MB / 10.06 sec = 4169.2960 Mbps 36 %TX 99 %RX [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 4957.7500 MB / 10.06 sec = 4135.7355 Mbps 36 %TX 99 %RX [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 4957.3777 MB / 10.06 sec = 4135.6252 Mbps 36 %TX 99 %RX [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 5059.1815 MB / 10.05 sec = 4221.3546 Mbps 37 %TX 99 %RX TCP Reno: [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 4973.3532 MB / 10.06 sec = 4147.3589 Mbps 36 %TX 100 %RX [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 4984.4375 MB / 10.06 sec = 4155.2131 Mbps 36 %TX 99 %RX [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 4995.6841 MB / 10.06 sec = 4166.2734 Mbps 36 %TX 100 %RX [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 4982.2500 MB / 10.05 sec = 4156.7586 Mbps 36 %TX 99 %RX [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 4989.9796 MB / 10.05 sec = 4163.0949 Mbps 36 %TX 99 %RX TSO disabled: TCP Cubic (initial_ssthresh set to 0): [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 5075.8125 MB / 10.02 sec = 4247.3408 Mbps 99 %TX 100 %RX [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 5056. MB / 10.03 sec = 4229.9621 Mbps 100 %TX 100 %RX [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 5047.4375 MB / 10.03 sec = 4223.1203 Mbps 99 %TX 100 %RX [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 5066.1875 MB / 10.03 sec = 4239.1659 Mbps 100 %TX 100 %RX [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 4986.3750 MB / 10.03 sec = 4171.9906 Mbps 99 %TX 100 %RX TCP Bic (initial_ssthresh set to 0): [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 5040.5625 MB / 10.03 sec = 4217.3521 Mbps 100 %TX 100 %RX [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 5049.7500 MB / 10.03 sec = 4225.4585 Mbps 99 %TX 100 %RX
[ofa-general] RE: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
From: Krishna Kumar2 [EMAIL PROTECTED] Date: Wed, 29 Aug 2007 10:43:23 +0530 The reason was to run parallel copies, not for buffer limitations. Oh, I see. I'll note in passing that current lmbench-3 has some parallelization features you could play with, you might want to check it out. I've also used iperf for parallel connections successfully, and that will allow you to mess with the buffer sizes as well along with other variables of the data streams for both TCP and UDP. Cheers, -PJ ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
Hi Dave, I am scp'ng from 192.168.1.1 to 192.168.1.2 and captured at the send side. 192.168.1.1.37201 192.168.1.2.ssh: P 837178092:837178596(504) ack 1976304527 win 79 nop,nop,timestamp 36791208 123919397 192.168.1.1.37201 192.168.1.2.ssh: . 837178596:837181492(2896) ack 1976304527 win 79 nop,nop,timestamp 36791208 123919397 192.168.1.1.37201 192.168.1.2.ssh: . 837181492:837184388(2896) ack 1976304527 win 79 nop,nop,timestamp 36791208 123919397 192.168.1.1.37201 192.168.1.2.ssh: . 837184388:837188732(4344) ack 1976304527 win 79 nop,nop,timestamp 36791208 123919397 192.168.1.1.37201 192.168.1.2.ssh: . 837188732:837193076(4344) ack 1976304527 win 79 nop,nop,timestamp 36791208 123919397 192.168.1.1.37201 192.168.1.2.ssh: . 837193076:837194524(1448) ack 1976304527 win 79 nop,nop,timestamp 36791208 123919397 192.168.1.2.ssh 192.168.1.1.37201: . ack 837165060 win 3338 nop,nop,timestamp 123919397 36791208 Data in pipeline: 837194524 - 837165060 = 29464. In most cases, I am getting 7K, 8K, 13K, and rarely close to 16K. I ran iperf with 4K, 16K 32K (as I could do multiple threads instead of single process). Results are (for E1000 82547GI chipset, BW in KB/s): Test Org BW New BW% Size:4096 Procs:1 114612 114644 .02 Size:16394 Procs:1 114634 114644 0 Size:32768 Procs:1 114645 114643 0 And for multiple threads: TestOrg BW New BW % Size:4096 Procs:8 114632 114637 0 Size:4096 Procs:16 114639 114637 0 Size:4096 Procs:64 114893 114800 -.08 Size:16394 Procs:8 114641 114642 0 Size:16394 Procs:16 114642 114643 0 Size:16394 Procs:64 114911 114781 -.11 Size:32768 Procs:8 114638 114639 0 Size:32768 Procs:16 114642 114645 0 Size:32768 Procs:64 114932 114777 -.13 I will run netperf and report CPU utilization too. Thanks, - KK David Miller [EMAIL PROTECTED] wrote on 08/21/2007 12:48:24 PM: From: Krishna Kumar2 [EMAIL PROTECTED] Date: Fri, 17 Aug 2007 11:36:03 +0530 I ran 3 iterations of 45 sec tests (total 1 hour 16 min, but I will run a longer one tonight). The results are (results in KB/s, and %): I ran a 8.5 hours run with no batching + another 8.5 hours run with batching (Buffer sizes: 32 128 512 4096 16384, Threads: 1 8 32, Each test run time: 3 minutes, Iterations to average: 5). TCP seems to get a small improvement. Using 16K buffer size really isn't going to keep the pipe full enough for TSO. And realistically applications queue much more data at a time. Also, with smaller buffer sizes can have negative effects for the dynamic receive and send buffer growth algorithm the kernel uses, it might consider the connection application limited for too long. I would really prefer to see numbers that use buffer sizes more in line with the amount of data that is typically inflight on a 1G connection on a local network. Do a tcpdump during the height of the transfer to see about what this value is. When an ACK comes in, compare the sequence number it's ACK'ing with the sequence number of the most recently sent frame. The difference is approximately the pipe size at maximum congestion window assuming a loss free local network. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
From: Krishna Kumar2 [EMAIL PROTECTED] Date: Wed, 29 Aug 2007 08:53:30 +0530 I am scp'ng from 192.168.1.1 to 192.168.1.2 and captured at the send side. Bad choice of test, this is cpu limited since the scp has to encrypt and MAC hash all the data it sends. Use something like straight ftp or bw_tcp from lmbench. Using a different tool seems strange to me, why not just adjust the buffer size with command line options in the benchmark you were using in the first place? ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
[EMAIL PROTECTED] wrote on 08/29/2007 10:21:50 AM: From: Krishna Kumar2 [EMAIL PROTECTED] Date: Wed, 29 Aug 2007 08:53:30 +0530 I am scp'ng from 192.168.1.1 to 192.168.1.2 and captured at the send side. Bad choice of test, this is cpu limited since the scp has to encrypt and MAC hash all the data it sends. Use something like straight ftp or bw_tcp from lmbench. OK Using a different tool seems strange to me, why not just adjust the buffer size with command line options in the benchmark you were using in the first place? The reason was to run parallel copies, not for buffer limitations. Let me use the same tool for benchmark. Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
From: Krishna Kumar2 [EMAIL PROTECTED] Date: Wed, 29 Aug 2007 10:43:23 +0530 The reason was to run parallel copies, not for buffer limitations. Oh, I see. I'll note in passing that current lmbench-3 has some parallelization features you could play with, you might want to check it out. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
On Sun, 2007-26-08 at 19:04 -0700, David Miller wrote: The transfer is much better behaved if we ACK every two full sized frames we copy into the receiver, and therefore don't stretch ACK, but at the cost of cpu utilization. The rx coalescing in theory should help by accumulating more ACKs on the rx side of the sender. But it doesnt seem to do that i.e For the 9K MTU, you are better off to turn off the coalescing if you want higher numbers. Also some of the TOE vendors (chelsio?) claim to have fixed this by reducing bursts on outgoing packets. Bill: who suggested (as per your email) the 75usec value and what was it based on measurement-wise? BTW, thanks for the finding the energy to run those tests and a very refreshing perspective. I dont mean to add more work, but i had some queries; On your earlier tests, i think that Reno showed some significant differences on the lower MTU case over BIC. I wonder if this is consistent? A side note: Although the experimentation reduces the variables (eg tying all to CPU0), it would be more exciting to see multi-cpu and multi-flow sender effect (which IMO is more real world). Last note: you need a newer netstat. These effects are particularly pronounced on systems where the bus bandwidth is also one of the limiting factors. Can you elucidate this a little more Dave? Did you mean memory bandwidth? cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
On Fri, 24 Aug 2007, John Heffner wrote: Bill Fink wrote: Here you can see there is a major difference in the TX CPU utilization (99 % with TSO disabled versus only 39 % with TSO enabled), although the TSO disabled case was able to squeeze out a little extra performance from its extra CPU utilization. Interestingly, with TSO enabled, the receiver actually consumed more CPU than with TSO disabled, so I guess the receiver CPU saturation in that case (99 %) was what restricted its performance somewhat (this was consistent across a few test runs). One possibility is that I think the receive-side processing tends to do better when receiving into an empty queue. When the (non-TSO) sender is the flow's bottleneck, this is going to be the case. But when you switch to TSO, the receiver becomes the bottleneck and you're always going to have to put the packets at the back of the receive queue. This might help account for the reason why you have both lower throughput and higher CPU utilization -- there's a point of instability right where the receiver becomes the bottleneck and you end up pushing it over to the bad side. :) Just a theory. I'm honestly surprised this effect would be so significant. What do the numbers from netstat -s look like in the two cases? Well, I was going to check this out, but I happened to reboot the system and now I get somewhat different results. Here are the new results, which should hopefully be more accurate since they are on a freshly booted system. TSO enabled and GSO disabled: [EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16 11610.6875 MB / 10.00 sec = 9735.9526 Mbps 100 %TX 75 %RX [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 5029.6875 MB / 10.06 sec = 4194.6931 Mbps 36 %TX 100 %RX TSO disabled and GSO disabled: [EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16 11817.9375 MB / 10.00 sec = 9909.7773 Mbps 99 %TX 77 %RX [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 5823.3125 MB / 10.00 sec = 4883.2429 Mbps 100 %TX 82 %RX The TSO disabled case got a little better performance even for 9000 byte jumbo frames. For the -M1460 case eumalating a standard 1500 byte Ethernet MTU, the performance was significantly better and used less CPU on the receiver (82 % versus 100 %) although it did use significantly more CPU on the transmitter (100 % versus 36 %). TSO disabled and GSO enabled: [EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16 11609.5625 MB / 10.00 sec = 9734.9859 Mbps 99 %TX 75 %RX [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 5001.4375 MB / 10.06 sec = 4170.6739 Mbps 52 %TX 100 %RX The GSO enabled case is very similar to the TSO enabled case, except that for the -M1460 test the transmitter used more CPU (52 % versus 36 %), which is to be expected since TSO has hardware assist. Here's the beforeafter delta of the receiver's netstat -s statistics for the TSO enabled case: Ip: 3659898 total packets received 3659898 incoming packets delivered 80050 requests sent out Tcp: 2 passive connection openings 3659897 segments received 80050 segments send out TcpExt: 33 packets directly queued to recvmsg prequeue. 104956 packets directly received from backlog 705528 packets directly received from prequeue 3654842 packets header predicted 193 packets header predicted and directly queued to user 4 acknowledgments not containing data received 6 predicted acknowledgments And here it is for the TSO disabled case (GSO also disabled): Ip: 4107083 total packets received 4107083 incoming packets delivered 1401376 requests sent out Tcp: 2 passive connection openings 4107083 segments received 1401376 segments send out TcpExt: 2 TCP sockets finished time wait in fast timer 48486 packets directly queued to recvmsg prequeue. 1056111048 packets directly received from backlog 2273357712 packets directly received from prequeue 1819317 packets header predicted 2287497 packets header predicted and directly queued to user 4 acknowledgments not containing data received 10 predicted acknowledgments For the TSO disabled case, there are a huge amount more TCP segments sent out (1401376 versus 80050), which I assume are ACKs, and which could possibly contribute to the higher throughput for the TSO disabled case due to faster feedback, but not explain the lower CPU utilization. There are many more packets directly queued to recvmsg prequeue (48486 versus 33). The numbers for packets directly received from backlog and prequeue in the TCP disabled case seem bogus to me so I don't know how to interpret that. There are only about half as many packets header predicted (1819317 versus 3654842), but there are many more packets header predicted and directly queued to user (2287497 versus 193). I'll leave the analysis of all this to those who might actually know what it all means. I also ran another set of tests that may be
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
Bill Fink wrote: Here's the beforeafter delta of the receiver's netstat -s statistics for the TSO enabled case: Ip: 3659898 total packets received 3659898 incoming packets delivered 80050 requests sent out Tcp: 2 passive connection openings 3659897 segments received 80050 segments send out TcpExt: 33 packets directly queued to recvmsg prequeue. 104956 packets directly received from backlog 705528 packets directly received from prequeue 3654842 packets header predicted 193 packets header predicted and directly queued to user 4 acknowledgments not containing data received 6 predicted acknowledgments And here it is for the TSO disabled case (GSO also disabled): Ip: 4107083 total packets received 4107083 incoming packets delivered 1401376 requests sent out Tcp: 2 passive connection openings 4107083 segments received 1401376 segments send out TcpExt: 2 TCP sockets finished time wait in fast timer 48486 packets directly queued to recvmsg prequeue. 1056111048 packets directly received from backlog 2273357712 packets directly received from prequeue 1819317 packets header predicted 2287497 packets header predicted and directly queued to user 4 acknowledgments not containing data received 10 predicted acknowledgments For the TSO disabled case, there are a huge amount more TCP segments sent out (1401376 versus 80050), which I assume are ACKs, and which could possibly contribute to the higher throughput for the TSO disabled case due to faster feedback, but not explain the lower CPU utilization. There are many more packets directly queued to recvmsg prequeue (48486 versus 33). The numbers for packets directly received from backlog and prequeue in the TCP disabled case seem bogus to me so I don't know how to interpret that. There are only about half as many packets header predicted (1819317 versus 3654842), but there are many more packets header predicted and directly queued to user (2287497 versus 193). I'll leave the analysis of all this to those who might actually know what it all means. There are a few interesting things here. For one, the bursts caused by TSO seem to be causing the receiver to do stretch acks. This may have a negative impact on flow performance, but it's hard to say for sure how much. Interestingly, it will even further reduce the CPU load on the sender, since it has to process fewer acks. As I suspected, in the non-TSO case the receiver gets lots of packets directly queued to user. This should result in somewhat lower CPU utilization on the receiver. I don't know if it can account for all the difference you see. The backlog and prequeue values are probably correct, but netstat's description is wrong. A quick look at the code reveals these values are in units of bytes, not packets. -John ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
On Thu, 2007-23-08 at 23:18 -0400, Bill Fink wrote: [..] Here you can see there is a major difference in the TX CPU utilization (99 % with TSO disabled versus only 39 % with TSO enabled), although the TSO disabled case was able to squeeze out a little extra performance from its extra CPU utilization. Good stuff. What kind of machine? SMP? Seems the receive side of the sender is also consuming a lot more cpu i suspect because receiver is generating a lot more ACKs with TSO. Does the choice of the tcp congestion control algorithm affect results? it would be interesting to see both MTUs with either TCP BIC vs good old reno on sender (probably without changing what the receiver does). BIC seems to be the default lately. Interestingly, with TSO enabled, the receiver actually consumed more CPU than with TSO disabled, I would suspect the fact that a lot more packets making it into the receiver for TSO contributes. so I guess the receiver CPU saturation in that case (99 %) was what restricted its performance somewhat (this was consistent across a few test runs). Unfortunately the receiver plays a big role in such tests - if it is bottlenecked then you are not really testing the limits of the transmitter. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
On Fri, 24 Aug 2007, jamal wrote: On Thu, 2007-23-08 at 23:18 -0400, Bill Fink wrote: [..] Here you can see there is a major difference in the TX CPU utilization (99 % with TSO disabled versus only 39 % with TSO enabled), although the TSO disabled case was able to squeeze out a little extra performance from its extra CPU utilization. Good stuff. What kind of machine? SMP? Tyan Thunder K8WE S2895ANRF motherboard with Nvidia nForce Professional 2200+2050 chipset, 2 AMD Opteron 254 2.8 GHz CPUs, 4 GB PC3200 ECC REG-DDR 400 memory, and 2 PCI-Express x16 slots (2 buses). It is SMP but both the NIC interrupts and nuttcp are bound to CPU 0. And all other non-kernel system processes are bound to CPU 1. Seems the receive side of the sender is also consuming a lot more cpu i suspect because receiver is generating a lot more ACKs with TSO. Odd. I just reran the TCP CUBIC -M1460 tests, and with TSO enabled on the transmitter, there were about 153709 eth2 interrupts on the receiver, while with TSO disabled there was actually a somewhat higher number (164988) of receiver side eth2 interrupts, although the receive side CPU utilization was actually lower in that case. On the transmit side (different test run), the TSO enabled case had about 161773 eth2 interrupts whereas the TSO disabled case had about 165179 eth2 interrupts. Does the choice of the tcp congestion control algorithm affect results? it would be interesting to see both MTUs with either TCP BIC vs good old reno on sender (probably without changing what the receiver does). BIC seems to be the default lately. These tests were with the default TCP CUBIC (with initial_ssthresh set to 0). With TCP BIC (and initial_ssthresh set to 0): TSO enabled: [EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16 11751.3750 MB / 10.00 sec = 9853.9839 Mbps 100 %TX 83 %RX [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 4999.3321 MB / 10.06 sec = 4167.7872 Mbps 38 %TX 100 %RX TSO disabled: [EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16 11818.1875 MB / 10.00 sec = 9910.0682 Mbps 99 %TX 81 %RX [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 5502.6250 MB / 10.00 sec = 4614.3297 Mbps 100 %TX 84 %RX And with TCP Reno: TSO enabled: [EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16 11782.6250 MB / 10.00 sec = 9880.2613 Mbps 100 %TX 77 %RX [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 5024.6649 MB / 10.06 sec = 4191.6574 Mbps 38 %TX 99 %RX TSO disabled: [EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16 11818.2500 MB / 10.00 sec = 9910.0860 Mbps 99 %TX 77 %RX [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 5284. MB / 10.00 sec = 4430.9604 Mbps 99 %TX 79 %RX Very similar results to the original TCP CUBIC tests. Interestingly, with TSO enabled, the receiver actually consumed more CPU than with TSO disabled, I would suspect the fact that a lot more packets making it into the receiver for TSO contributes. so I guess the receiver CPU saturation in that case (99 %) was what restricted its performance somewhat (this was consistent across a few test runs). Unfortunately the receiver plays a big role in such tests - if it is bottlenecked then you are not really testing the limits of the transmitter. It might be interesting to see what affect the LRO changes would have on this. Once they are in a stable released kernel, I might try that out, or maybe even before if I get some spare time (but that's in very short supply right now). -Thanks -Bill ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
Bill Fink wrote: On Thu, 23 Aug 2007, Rick Jones wrote: jamal wrote: [TSO already passed - iirc, it has been demostranted to really not add much to throughput (cant improve much over closeness to wire speed) but improve CPU utilization]. In the one gig space sure, but in the 10 Gig space, TSO on/off does make a difference for throughput. Not too much. TSO enabled: [EMAIL PROTECTED] ~]# ethtool -k eth2 Offload parameters for eth2: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: on [EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16 11813.4375 MB / 10.00 sec = 9906.1644 Mbps 99 %TX 80 %RX TSO disabled: [EMAIL PROTECTED] ~]# ethtool -K eth2 tso off [EMAIL PROTECTED] ~]# ethtool -k eth2 Offload parameters for eth2: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: off [EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16 11818.2500 MB / 10.00 sec = 9910.0176 Mbps 100 %TX 78 %RX Pretty negligible difference it seems. Leaves one wondering how often more than one segment was sent to the card in the 9000 byte case :) rick jones ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
From: jamal [EMAIL PROTECTED] Date: Fri, 24 Aug 2007 08:14:16 -0400 Seems the receive side of the sender is also consuming a lot more cpu i suspect because receiver is generating a lot more ACKs with TSO. I've seen this behavior before on a low cpu powered receiver and the issue is that batching too much actually hurts a receiver. If the data packets were better spaced out, the receive would handle the load better. This is the thing the TOE guys keep talking about overcoming with their packet pacing algorithms in their on-card TOE stack. My hunch is that even if in the non-TSO case the TX packets were all back to back in the cards TX ring, TSO still spits them out faster on the wire. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
Bill Fink wrote: Here you can see there is a major difference in the TX CPU utilization (99 % with TSO disabled versus only 39 % with TSO enabled), although the TSO disabled case was able to squeeze out a little extra performance from its extra CPU utilization. Interestingly, with TSO enabled, the receiver actually consumed more CPU than with TSO disabled, so I guess the receiver CPU saturation in that case (99 %) was what restricted its performance somewhat (this was consistent across a few test runs). One possibility is that I think the receive-side processing tends to do better when receiving into an empty queue. When the (non-TSO) sender is the flow's bottleneck, this is going to be the case. But when you switch to TSO, the receiver becomes the bottleneck and you're always going to have to put the packets at the back of the receive queue. This might help account for the reason why you have both lower throughput and higher CPU utilization -- there's a point of instability right where the receiver becomes the bottleneck and you end up pushing it over to the bad side. :) Just a theory. I'm honestly surprised this effect would be so significant. What do the numbers from netstat -s look like in the two cases? -John ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
On Wed, 2007-22-08 at 13:21 -0700, David Miller wrote: From: Rick Jones [EMAIL PROTECTED] Date: Wed, 22 Aug 2007 10:09:37 -0700 Should it be any more or less worrysome than small packet performance (eg the TCP_RR stuff I posted recently) being rather worse with TSO enabled than with it disabled? That, like any such thing shown by the batching changes, is a bug to fix. Possibly a bug - but you really should turn off TSO if you are doing huge interactive transactions (which is fair because there is a clear demarcation). The litmus test is the same as any change that is supposed to improve net performance - it has to demonstrate it is not intrusive and that it improves (consistently) performance. The standard metrics are {throughput, cpu-utilization, latency} i.e as long as one improves and others remain zero, it would make sense. Yes, i am religious for batching after all the invested sweat (and i continue to work on it hoping to demystify) - the theory makes a lot of sense. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
On Thu, 2007-23-08 at 18:04 -0400, jamal wrote: The litmus test is the same as any change that is supposed to improve net performance - it has to demonstrate it is not intrusive and that it improves (consistently) performance. The standard metrics are {throughput, cpu-utilization, latency} i.e as long as one improves and others remain zero, it would make sense. Yes, i am religious for batching after all the invested sweat (and i continue to work on it hoping to demystify) - the theory makes a lot of sense. Before someone jumps and strangles me ;- By litmus test i meant as applied to batching. [TSO already passed - iirc, it has been demostranted to really not add much to throughput (cant improve much over closeness to wire speed) but improve CPU utilization]. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
From: jamal [EMAIL PROTECTED] Date: Thu, 23 Aug 2007 18:04:10 -0400 Possibly a bug - but you really should turn off TSO if you are doing huge interactive transactions (which is fair because there is a clear demarcation). I don't see how this can matter. TSO only ever does anything if you accumulate more than one MSS worth of data. And when that does happen, all it does is take whats in the send queue and send as much as possible at once. The packets are already built in big chunks, so there is no extra work to do. The card is going to send the things back to back and as fast as in the non-TSO case as well. It doesn't change application scheduling, and it absolutely does not penalize small sends by the application unless we have a bug somewhere. So I see no reason to disable TSO for any reason other than hardware implementation deficiencies. And for the drivers I am familiar with they do make smart default TSO enabling decisions based upon how well the chip does TSO. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
jamal wrote: [TSO already passed - iirc, it has been demostranted to really not add much to throughput (cant improve much over closeness to wire speed) but improve CPU utilization]. In the one gig space sure, but in the 10 Gig space, TSO on/off does make a difference for throughput. rick jones ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
On Thu, 2007-23-08 at 15:30 -0700, David Miller wrote: From: jamal [EMAIL PROTECTED] Date: Thu, 23 Aug 2007 18:04:10 -0400 Possibly a bug - but you really should turn off TSO if you are doing huge interactive transactions (which is fair because there is a clear demarcation). I don't see how this can matter. TSO only ever does anything if you accumulate more than one MSS worth of data. I stand corrected then. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
On Thu, 2007-23-08 at 15:35 -0700, Rick Jones wrote: jamal wrote: [TSO already passed - iirc, it has been demostranted to really not add much to throughput (cant improve much over closeness to wire speed) but improve CPU utilization]. In the one gig space sure, but in the 10 Gig space, TSO on/off does make a difference for throughput. I am still so 1Gige;- I stand corrected again ;- cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
On Thu, 23 Aug 2007, Rick Jones wrote: jamal wrote: [TSO already passed - iirc, it has been demostranted to really not add much to throughput (cant improve much over closeness to wire speed) but improve CPU utilization]. In the one gig space sure, but in the 10 Gig space, TSO on/off does make a difference for throughput. Not too much. TSO enabled: [EMAIL PROTECTED] ~]# ethtool -k eth2 Offload parameters for eth2: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: on [EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16 11813.4375 MB / 10.00 sec = 9906.1644 Mbps 99 %TX 80 %RX TSO disabled: [EMAIL PROTECTED] ~]# ethtool -K eth2 tso off [EMAIL PROTECTED] ~]# ethtool -k eth2 Offload parameters for eth2: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: off [EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16 11818.2500 MB / 10.00 sec = 9910.0176 Mbps 100 %TX 78 %RX Pretty negligible difference it seems. This is with a 2.6.20.7 kernel, Myricom 10-GigE NICs, and 9000 byte jumbo frames, in a LAN environment. For grins, I also did a couple of tests with an MSS of 1460 to emulate a standard 1500 byte Ethernet MTU. TSO enabled: [EMAIL PROTECTED] ~]# ethtool -k eth2 Offload parameters for eth2: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: on [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 5102.8503 MB / 10.06 sec = 4253.9124 Mbps 39 %TX 99 %RX TSO disabled: [EMAIL PROTECTED] ~]# ethtool -K eth2 tso off [EMAIL PROTECTED] ~]# ethtool -k eth2 Offload parameters for eth2: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: off [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 5399.5625 MB / 10.00 sec = 4527.9070 Mbps 99 %TX 76 %RX Here you can see there is a major difference in the TX CPU utilization (99 % with TSO disabled versus only 39 % with TSO enabled), although the TSO disabled case was able to squeeze out a little extra performance from its extra CPU utilization. Interestingly, with TSO enabled, the receiver actually consumed more CPU than with TSO disabled, so I guess the receiver CPU saturation in that case (99 %) was what restricted its performance somewhat (this was consistent across a few test runs). -Bill ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
On Thu, 23 Aug 2007 18:38:22 -0400 jamal [EMAIL PROTECTED] wrote: On Thu, 2007-23-08 at 15:30 -0700, David Miller wrote: From: jamal [EMAIL PROTECTED] Date: Thu, 23 Aug 2007 18:04:10 -0400 Possibly a bug - but you really should turn off TSO if you are doing huge interactive transactions (which is fair because there is a clear demarcation). I don't see how this can matter. TSO only ever does anything if you accumulate more than one MSS worth of data. I stand corrected then. cheers, jamal For most normal Internet TCP connections, you will see only 2 or 3 packets per TSO because of ACK clocking. If you turn off delayed ACK on the receiver it will be even less. A current hot topic of research is reducing the number of ACK's to make TCP work better over asymmetric links like 3G. -- Stephen Hemminger [EMAIL PROTECTED] ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
Hi Dave, David Miller [EMAIL PROTECTED] wrote on 08/22/2007 09:52:29 AM: From: Krishna Kumar2 [EMAIL PROTECTED] Date: Wed, 22 Aug 2007 09:41:52 +0530 snip Because TSO does batching already, so it's a very good tit for tat comparison of the new batching scheme vs. an existing one. I am planning to do more testing on your suggestion over the weekend, but I had a comment. Are you saying that TSO and batching should be mutually exclusive so hardware that doesn't support TSO (like IB) only would benefit? But even if they can co-exist, aren't cases like sending multiple small skbs better handled with batching? I'm not making any suggestions, so don't read that into anything I've said :-) I think the jury is still out, but seeing TSO perform even slightly worse with the batching changes in place would be very worrysome. This applies to both throughput and cpu utilization. Does turning off batching solve that problem? What I mean by that is: batching can be disabled if a TSO device is worse for some cases. Infact something that I had changed my latest code is to not enable batching in register_netdevice (in Rev4 which I am sending in a few mins), rather the user has to explicitly turn 'on' batching. Wondering if that is what you are concerned about. In any case, I will test your case on Monday (I am on vacation for next couple of days). Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
From: Krishna Kumar2 [EMAIL PROTECTED] Date: Wed, 22 Aug 2007 12:33:04 +0530 Does turning off batching solve that problem? What I mean by that is: batching can be disabled if a TSO device is worse for some cases. This new batching stuff isn't going to be enabled or disabled on a per-device basis just to get parity with how things are now. It should be enabled by default, and give at least as good performance as what can be obtained right now. Otherwise it's a clear regression. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
David Miller wrote: I think the jury is still out, but seeing TSO perform even slightly worse with the batching changes in place would be very worrysome. This applies to both throughput and cpu utilization. Should it be any more or less worrysome than small packet performance (eg the TCP_RR stuff I posted recently) being rather worse with TSO enabled than with it disabled? rick jones ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
From: Rick Jones [EMAIL PROTECTED] Date: Wed, 22 Aug 2007 10:09:37 -0700 Should it be any more or less worrysome than small packet performance (eg the TCP_RR stuff I posted recently) being rather worse with TSO enabled than with it disabled? That, like any such thing shown by the batching changes, is a bug to fix. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
David Miller [EMAIL PROTECTED] wrote on 08/22/2007 02:44:40 PM: From: Krishna Kumar2 [EMAIL PROTECTED] Date: Wed, 22 Aug 2007 12:33:04 +0530 Does turning off batching solve that problem? What I mean by that is: batching can be disabled if a TSO device is worse for some cases. This new batching stuff isn't going to be enabled or disabled on a per-device basis just to get parity with how things are now. It should be enabled by default, and give at least as good performance as what can be obtained right now. That was how it was in earlier revisions. In revision4 I coded it so that it is enabled only if explicitly set by the user. I can revert that change. Otherwise it's a clear regression. Definitely. For drivers that support it, it should not reduce performance. Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
From: Krishna Kumar2 [EMAIL PROTECTED] Date: Fri, 17 Aug 2007 11:36:03 +0530 I ran 3 iterations of 45 sec tests (total 1 hour 16 min, but I will run a longer one tonight). The results are (results in KB/s, and %): I ran a 8.5 hours run with no batching + another 8.5 hours run with batching (Buffer sizes: 32 128 512 4096 16384, Threads: 1 8 32, Each test run time: 3 minutes, Iterations to average: 5). TCP seems to get a small improvement. Using 16K buffer size really isn't going to keep the pipe full enough for TSO. And realistically applications queue much more data at a time. Also, with smaller buffer sizes can have negative effects for the dynamic receive and send buffer growth algorithm the kernel uses, it might consider the connection application limited for too long. I would really prefer to see numbers that use buffer sizes more in line with the amount of data that is typically inflight on a 1G connection on a local network. Do a tcpdump during the height of the transfer to see about what this value is. When an ACK comes in, compare the sequence number it's ACK'ing with the sequence number of the most recently sent frame. The difference is approximately the pipe size at maximum congestion window assuming a loss free local network. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
On Tue, 2007-21-08 at 00:18 -0700, David Miller wrote: Using 16K buffer size really isn't going to keep the pipe full enough for TSO. Why the comparison with TSO (or GSO for that matter)? Seems to me that is only valid/fair if you have a single flow. Batching is multi-flow focused (or i should say flow-unaware). cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
From: jamal [EMAIL PROTECTED] Date: Tue, 21 Aug 2007 08:30:22 -0400 On Tue, 2007-21-08 at 00:18 -0700, David Miller wrote: Using 16K buffer size really isn't going to keep the pipe full enough for TSO. Why the comparison with TSO (or GSO for that matter)? Because TSO does batching already, so it's a very good tit for tat comparison of the new batching scheme vs. an existing one. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
On Tue, 2007-21-08 at 11:51 -0700, David Miller wrote: Because TSO does batching already, so it's a very good tit for tat comparison of the new batching scheme vs. an existing one. Fair enough - I may have read too much into your email then;- For bulk type of apps (where TSO will make a difference) this a fair test. Hence i agree the 16KB buffer size is not sensible if the goal is to simulate such an app. However (and this is where i read too much into what you were saying) that the test by itself is insufficient comparison. You gotta look at the other side of the coin i.e. at apps where TSO wont buy much. Examples, a busy ssh or irc server and you could go as far as looking at the most pre-dominant app on the wild west, http (average page size from a few years back was in the range of 10-20K and can be simulated with good ole netperf/iperf). cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
David Miller [EMAIL PROTECTED] wrote on 08/22/2007 12:21:43 AM: From: jamal [EMAIL PROTECTED] Date: Tue, 21 Aug 2007 08:30:22 -0400 On Tue, 2007-21-08 at 00:18 -0700, David Miller wrote: Using 16K buffer size really isn't going to keep the pipe full enough for TSO. Why the comparison with TSO (or GSO for that matter)? Because TSO does batching already, so it's a very good tit for tat comparison of the new batching scheme vs. an existing one. I am planning to do more testing on your suggestion over the weekend, but I had a comment. Are you saying that TSO and batching should be mutually exclusive so hardware that doesn't support TSO (like IB) only would benefit? But even if they can co-exist, aren't cases like sending multiple small skbs better handled with batching? Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
From: Krishna Kumar2 [EMAIL PROTECTED] Date: Wed, 22 Aug 2007 09:41:52 +0530 David Miller [EMAIL PROTECTED] wrote on 08/22/2007 12:21:43 AM: From: jamal [EMAIL PROTECTED] Date: Tue, 21 Aug 2007 08:30:22 -0400 On Tue, 2007-21-08 at 00:18 -0700, David Miller wrote: Using 16K buffer size really isn't going to keep the pipe full enough for TSO. Why the comparison with TSO (or GSO for that matter)? Because TSO does batching already, so it's a very good tit for tat comparison of the new batching scheme vs. an existing one. I am planning to do more testing on your suggestion over the weekend, but I had a comment. Are you saying that TSO and batching should be mutually exclusive so hardware that doesn't support TSO (like IB) only would benefit? But even if they can co-exist, aren't cases like sending multiple small skbs better handled with batching? I'm not making any suggestions, so don't read that into anything I've said :-) I think the jury is still out, but seeing TSO perform even slightly worse with the batching changes in place would be very worrysome. This applies to both throughput and cpu utilization. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
Hi Dave, I ran 3 iterations of 45 sec tests (total 1 hour 16 min, but I will run a longer one tonight). The results are (results in KB/s, and %): I ran a 8.5 hours run with no batching + another 8.5 hours run with batching (Buffer sizes: 32 128 512 4096 16384, Threads: 1 8 32, Each test run time: 3 minutes, Iterations to average: 5). TCP seems to get a small improvement. Thanks, - KK --- TCP --- Size:32 Procs:1 3415 3321 -2.75 Size:128 Procs:1 13094 13388 2.24 Size:512 Procs:149037 50683 3.35 Size:4096 Procs:1 114646 114619 -.02 Size:16384 Procs:1114626 114644 .01 Size:32 Procs:8 22675 22633 -.18 Size:128 Procs:877994 77297 -.89 Size:512 Procs:8114716 114711 0 Size:4096 Procs:8 114637 114636 0 Size:16384 Procs:8 95814 114638 19.64 Size:32 Procs:3223240 23349 .46 Size:128 Procs:32 82284 82247 -.04 Size:512 Procs:32 114885 114769 -.10 Size:4096 Procs:32 95735 114634 19.74 Size:16384 Procs:32 114736 114641 -.08 Average:115153411902103.36% --- No Delay: - Size:32 Procs:1 3002 2873 -4.29 Size:128 Procs:1 11853 11801 -.43 Size:512 Procs:1 45565 45837 .59 Size:4096 Procs:1 114511 114485 -.02 Size:16384 Procs:1114521 114555 .02 Size:32 Procs:8 8026 8029 .03 Size:128 Procs:8 31589 31573 -.05 Size:512 Procs:8 111506 105766 -5.14 Size:4096 Procs:8 114455 114454 0 Size:16384 Procs:895833 114491 19.46 Size:32 Procs:328005 8027 .27 Size:128 Procs:32 31475 31505 .09 Size:512 Procs:32 114558 113687 -.76 Size:4096 Procs:32114784 114447 -.29 Size:16384 Procs:32 114719 114496 -.19 Average: 10460261034402-1.11% --- ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
Hi Dave, David Miller [EMAIL PROTECTED] wrote on 08/08/2007 04:19:00 PM: From: Krishna Kumar [EMAIL PROTECTED] Date: Wed, 08 Aug 2007 15:01:14 +0530 RESULTS: The performance improvement for TCP No Delay is in the range of -8% to 320% (with -8% being the sole negative), with many individual tests giving 50% or more improvement (I think it is to do with the hw slots getting full quicker resulting in more batching when the queue gets woken). The results for TCP is in the range of -11% to 93%, with most of the tests (8/12) giving improvements. Not because I think it obviates your work, but rather because I'm curious, could you test a TSO-in-hardware driver converted to batching and see how TSO alone compares to batching for a pure TCP workload? I personally don't think it will help for that case at all as TSO likely does better job of coalescing the work _and_ reducing bus traffic as well as work in the TCP stack. I used E1000 (guess the choice is OK as e1000_tso returns TRUE. My hw is 82547GI). You are right, it doesn't help TSO case at all (infact degrades). Two things to note though: - E1000 may not be suitable for adding batching (which is no longer a new API, as I have changed it already). - Small skbs where TSO doesn't come into picture still seems to improve. A couple of cases for large skbs did result in some improved (like 4K, TCP No Delay, 32 procs). [Total segments retransmission for original code test run: 2220 for new code test run: 1620. So the retransmission problem that I was getting seems to be an IPoIB bug, though I did have to fix one bug in my networking component where I was calling qdisc_run(NULL) for regular xmit path and change to always use batching. The problem is that skb1 - skb10 may be present in the queue after each of them failed to be sent out, then net_tx_action fires which batches all of these into the blist and tries to send them out again, which also fails (eg tx lock fail or queue full), then the next single skb xmit will send the latest skb ignoring the 10 skbs that are already waiting in the batching list. These 10 skbs are sent out only the next time net_tx_action is called, so out of order skbs result. This fix reduced retransmissions from 180,000 to 55,000 or so. When I changed IPoIB driver to use iterative sends of each skb instead of creating multiple Work Request's, that number went down to 15]. I ran 3 iterations of 45 sec tests (total 1 hour 16 min, but I will run a longer one tonight). The results are (results in KB/s, and %): Test Case Org BW New BW % Change TCP Size:32 Procs:1 18483918112.01 Size:32 Procs:8 21888 21555 -1.52 Size:32 Procs:3219317 22433 16.13 Size:256 Procs:115584 25991 66.78 Size:256 Procs:8110937 74565 -32.78 Size:256 Procs:32 105767 98967 -6.42 Size:4096 Procs:1 81910 96073 17.29 Size:4096 Procs:8 113302 94040 -17.00 Size:4096 Procs:32 109664 105522 -3.77 TCP No Delay: -- Size:32 Procs:1 2688317718.19 Size:32 Procs:8 656810588 61.20 Size:32 Procs:326573783819.24 Size:256 Procs:1786912724 61.69 Size:256 Procs:865652 45652 -30.46 Size:256 Procs:32 95114 112279 18.04 Size:4096 Procs:1 95302 84664 -11.16 Size:4096 Procs:8 19 89111 -19.80 Size:4096 Procs:32 109249 113919 4.27 I will submit Rev4 with suggested changes (including single merged API) on Thursday after some more testing. Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
Forgot to mention one thing: This fix reduced retransmissions from 180,000 to 55,000 or so. When I changed IPoIB driver to use iterative sends of each skb instead of creating multiple Work Request's, that number went down to 15]. This also reduced TCP No Delay performance from huge percentages like 200-400% and now is almost the same as original code. So fixing this problem in IPoIB (driver?) will enable to use the multiple Work Request Work Completion, rather than limiting batching to single WR/WC. thanks, - KK __ Hi Dave, David Miller [EMAIL PROTECTED] wrote on 08/08/2007 04:19:00 PM: From: Krishna Kumar [EMAIL PROTECTED] Date: Wed, 08 Aug 2007 15:01:14 +0530 RESULTS: The performance improvement for TCP No Delay is in the range of -8% to 320% (with -8% being the sole negative), with many individual tests giving 50% or more improvement (I think it is to do with the hw slots getting full quicker resulting in more batching when the queue gets woken). The results for TCP is in the range of -11% to 93%, with most of the tests (8/12) giving improvements. Not because I think it obviates your work, but rather because I'm curious, could you test a TSO-in-hardware driver converted to batching and see how TSO alone compares to batching for a pure TCP workload? I personally don't think it will help for that case at all as TSO likely does better job of coalescing the work _and_ reducing bus traffic as well as work in the TCP stack. I used E1000 (guess the choice is OK as e1000_tso returns TRUE. My hw is 82547GI). You are right, it doesn't help TSO case at all (infact degrades). Two things to note though: - E1000 may not be suitable for adding batching (which is no longer a new API, as I have changed it already). - Small skbs where TSO doesn't come into picture still seems to improve. A couple of cases for large skbs did result in some improved (like 4K, TCP No Delay, 32 procs). [Total segments retransmission for original code test run: 2220 for new code test run: 1620. So the retransmission problem that I was getting seems to be an IPoIB bug, though I did have to fix one bug in my networking component where I was calling qdisc_run(NULL) for regular xmit path and change to always use batching. The problem is that skb1 - skb10 may be present in the queue after each of them failed to be sent out, then net_tx_action fires which batches all of these into the blist and tries to send them out again, which also fails (eg tx lock fail or queue full), then the next single skb xmit will send the latest skb ignoring the 10 skbs that are already waiting in the batching list. These 10 skbs are sent out only the next time net_tx_action is called, so out of order skbs result. This fix reduced retransmissions from 180,000 to 55,000 or so. When I changed IPoIB driver to use iterative sends of each skb instead of creating multiple Work Request's, that number went down to 15]. I ran 3 iterations of 45 sec tests (total 1 hour 16 min, but I will run a longer one tonight). The results are (results in KB/s, and %): Test Case Org BW New BW % Change TCP Size:32 Procs:1 18483918112.01 Size:32 Procs:8 21888 21555 -1.52 Size:32 Procs:3219317 22433 16.13 Size:256 Procs:115584 25991 66.78 Size:256 Procs:8110937 74565 -32.78 Size:256 Procs:32 105767 98967 -6.42 Size:4096 Procs:1 81910 96073 17.29 Size:4096 Procs:8 113302 94040 -17.00 Size:4096 Procs:32 109664 105522 -3.77 TCP No Delay: -- Size:32 Procs:1 2688317718.19 Size:32 Procs:8 656810588 61.20 Size:32 Procs:326573783819.24 Size:256 Procs:1786912724 61.69 Size:256 Procs:865652 45652 -30.46 Size:256 Procs:32 95114 112279 18.04 Size:4096 Procs:1 95302 84664 -11.16 Size:4096 Procs:8 19 89111 -19.80 Size:4096 Procs:32 109249 113919 4.27 I will submit Rev4 with suggested changes (including single merged API) on Thursday after some more testing. Thanks, - KK - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
David Miller [EMAIL PROTECTED] wrote on 08/09/2007 09:57:27 AM: Patrick had suggested calling dev_hard_start_xmit() instead of conditionally calling the new API and to remove the new API entirely. The driver determines whether batching is required or not depending on (skb==NULL) or not. Would that approach be fine with this single interface goal ? It is a valid posibility. Note that this is similar to how we handle TSO, the driver sets the feature bit and in its -hard_start_xmit() it checks the SKB for the given offload property. Great, I will try to get rid of two paths entirely, and see how to re-arrange the code cleanly. thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
From: Krishna Kumar [EMAIL PROTECTED] Date: Wed, 08 Aug 2007 15:01:14 +0530 RESULTS: The performance improvement for TCP No Delay is in the range of -8% to 320% (with -8% being the sole negative), with many individual tests giving 50% or more improvement (I think it is to do with the hw slots getting full quicker resulting in more batching when the queue gets woken). The results for TCP is in the range of -11% to 93%, with most of the tests (8/12) giving improvements. Not because I think it obviates your work, but rather because I'm curious, could you test a TSO-in-hardware driver converted to batching and see how TSO alone compares to batching for a pure TCP workload? I personally don't think it will help for that case at all as TSO likely does better job of coalescing the work _and_ reducing bus traffic as well as work in the TCP stack. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
On Wed, 2007-08-08 at 21:42 +0800, Herbert Xu wrote: On Wed, Aug 08, 2007 at 03:49:00AM -0700, David Miller wrote: Not because I think it obviates your work, but rather because I'm curious, could you test a TSO-in-hardware driver converted to batching and see how TSO alone compares to batching for a pure TCP workload? You could even lower the bar by disabling TSO and enabling software GSO. From my observation for TCP packets slightly above MDU (upto 2K), GSO gives worse performance than non-GSO throughput-wise. Actually this has nothing to do with batching, rather the behavior is consistent with or without batching changes. I personally don't think it will help for that case at all as TSO likely does better job of coalescing the work _and_ reducing bus traffic as well as work in the TCP stack. I agree. I suspect the bulk of the effort is in getting these skb's created and processed by the stack so that by the time that they're exiting the qdisc there's not much to be saved anymore. pktgen shows a clear win if you test the driver path - which is what you should test because thats where the batching changes are. Using TCP or UDP adds other variables[1] that need to be isolated first in order to quantify the effect of batching. For throughput and CPU utilization, the benefit will be clear when there are a lot more flows. cheers, jamal [1] I think there are too many other variables in play unfortunately when you are dealing with a path that starts above the driver and one that covers end to end effect: traffic/app source, system clock sources as per my recent discovery, congestion control algorithms used, tuning of recevier etc. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
From: Krishna Kumar2 [EMAIL PROTECTED] Date: Wed, 8 Aug 2007 16:39:47 +0530 What do you generally think of the patch/implementation ? :) We have two driver implementation paths on recieve and now we'll have two on send, and that's not a good trend. In an ideal world all the drivers would be NAPI and netif_rx() would only be used by tunneling drivers and similar in the protocol layers. And likewise all sends would go through -hard_start_xmit(). If you can come up with a long term strategy that gets rid of the special transmit method, that'd be great. We should make Linux network drivers easy to write, not more difficult by constantly adding most interfaces than we consolidate. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
From: jamal [EMAIL PROTECTED] Date: Wed, 08 Aug 2007 11:14:35 -0400 pktgen shows a clear win if you test the driver path - which is what you should test because thats where the batching changes are. The driver path, however, does not exist on an island and what we care about is the final result with the changes running inside the full system. So, to be honest, besides for initial internal development feedback, the isolated tests only have minimal merit and it's the full protocol tests that are really interesting. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
Hello Herbert, Not because I think it obviates your work, but rather because I'm curious, could you test a TSO-in-hardware driver converted to batching and see how TSO alone compares to batching for a pure TCP workload? You could even lower the bar by disabling TSO and enabling software GSO. We had a discuss before. GSO doesn't benefit device which has no HW checksum (like IPoIB) by inducing an extra copy. And GSO benefits one stream, batching benefits multiple streams. Thanks Shirley ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
Herbert Xu [EMAIL PROTECTED] wrote on 08/08/2007 07:12:47 PM: On Wed, Aug 08, 2007 at 03:49:00AM -0700, David Miller wrote: Not because I think it obviates your work, but rather because I'm curious, could you test a TSO-in-hardware driver converted to batching and see how TSO alone compares to batching for a pure TCP workload? You could even lower the bar by disabling TSO and enabling software GSO. I will try with E1000 (though I didn't see improvement when I tested a long time back). The difference I expect is that TSO would help with large packets and not necessarily small/medium packets and not definitely in the case of multiple different skbs (as opposed to single large skb) getting queue'd. I think these are two different workloads. I personally don't think it will help for that case at all as TSO likely does better job of coalescing the work _and_ reducing bus traffic as well as work in the TCP stack. I agree. I suspect the bulk of the effort is in getting these skb's created and processed by the stack so that by the time that they're exiting the qdisc there's not much to be saved anymore. However, I am getting a large improvement for IPoIB specifically for this same case. The reason - batching will help only when queue gets full and stopped (and to a lesser extent if tx lock was not got, which results in fewer amount of batching that can be done). thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
Hi Dave, David Miller [EMAIL PROTECTED] wrote on 08/09/2007 03:31:37 AM: What do you generally think of the patch/implementation ? :) We have two driver implementation paths on recieve and now we'll have two on send, and that's not a good trend. Correct. In an ideal world all the drivers would be NAPI and netif_rx() would only be used by tunneling drivers and similar in the protocol layers. And likewise all sends would go through -hard_start_xmit(). If you can come up with a long term strategy that gets rid of the special transmit method, that'd be great. We should make Linux network drivers easy to write, not more difficult by constantly adding most interfaces than we consolidate. I think that is a good top level view, and I agree with that. Patrick had suggested calling dev_hard_start_xmit() instead of conditionally calling the new API and to remove the new API entirely. The driver determines whether batching is required or not depending on (skb==NULL) or not. Would that approach be fine with this single interface goal ? Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
From: Krishna Kumar2 [EMAIL PROTECTED] Date: Thu, 9 Aug 2007 09:49:57 +0530 Patrick had suggested calling dev_hard_start_xmit() instead of conditionally calling the new API and to remove the new API entirely. The driver determines whether batching is required or not depending on (skb==NULL) or not. Would that approach be fine with this single interface goal ? It is a valid posibility. Note that this is similar to how we handle TSO, the driver sets the feature bit and in its -hard_start_xmit() it checks the SKB for the given offload property. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general