Hi Craig, Thank you for pulling the data together on a website. I believe the results are quite interesting. It is probably worthwhile to point out a few performance points:
Sockets over TOE: - gets line rate for small IO size (<1KB) with 1/2 line rate just north of 256B - cpu utilization drops to about 25% for receive and about 12.5% for transmit - out of a single core; various folks would prolly reports this as 8% and 3% when considering the processing power of the entire machine. - 1B latency is about 10usecs Sockets of SDP: - gets line rate for IO sizes of about 16KB (ZCOPY disabled) and 64KB (ZCOPY enabled) - cpu utilization is about 100%, even for large IO and the benefit of ZCOPY is limited (about 12.5%) - 1B latency is about 20usecs You can make the same comparison for Sockets over NIC as well. I believe that these numbers show the benefit of running sockets apps directly over the T3 TOE interface (instead of mapping a TCP streaming interface to a RDMA interface and then eventually back to a TCP stream :) which is very efficient, i.e. a lot of folks believe that TOE provides little benefit, and even less benefit for small IO (which is so crucial for many apps) but these results really prove them wrong. Note that the NIC requires an IO size of 4KB to reach line rate and performance falls off again as the IO sizes increases (beyond CPU cache sizes). This might even be more surprising as you use a MTU of 9KB (jumbo frames) and the NIC vs TOE comparison would tip in the TOE's favor even faster if you were to run with MTU 1500. Note that there is a little correction with respect to T3 and DMA address range (for iWarp). T3 does not have any address limitation and can DMA to/from any 64b address. However, memory region sizes are limited to 4GB. OFED currently attempts to map the entire address space for DMA (which, IMHO, is questionable as the entire address space is opened up for DMA - what about UNIX security semantics? :-/). It would prolly be better (more secure) if apps were only to map address ranges that they really want to DMA to/from and then a 4GB region size limitation seems adequate. Regards, felix > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:general- > [EMAIL PROTECTED] On Behalf Of Craig Prescott > Sent: Wednesday, February 13, 2008 9:32 PM > To: Scott Weitzenkamp (sweitzen) > Cc: [email protected]; [EMAIL PROTECTED] > Subject: Re: [ofa-general] SDP performance with bzcopy testing help > needed > > Scott Weitzenkamp (sweitzen) wrote: > >> But the effect is still clear. > >> > >> throughput: > >> > >> 64K 128K 1M > >> SDP 7602.40 7560.57 5791.56 > >> BZCOPY 5454.20 6378.48 7316.28 > >> > > > > Looks unclear to me. Sometimes BZCOPY does better, sometimes worse. > > > > > Fair enough. > > While measuring a broader spectrum of message sizes, I noted a > big variation in throughput and send service demand for the SDP > case as a function of which core/CPU the netperf ran on. > Particularly, which CPU the netperf ran on relative to which > CPU was handling the interrupts for ib_mthca. > > Netperf has an option (-T) to allow for local and remote cpu > binding. So I used it to force the client and server to run on > CPU 0. Further, I mapped all ib_mthca interrupts to CPU 1 (irqbalance > was already disabled). This appears to have reduced the statistical > error between netperf runs to negligible amounts. I'll do more runs > to verify this and check out the other permutations, but this is what > has come out so far. > > TPUT = throughput (Mbits/sec) > LCL = send service demand (usec/KB) > RMT = recv service demand (usec/KB) > > "-T 0,0" option given to netperf client: > > SDP BZCOPY > -------------------- -------------------- > MESGSIZE TPUT LCL RMT TPUT LCL RMT > -------- ------- ----- ----- ------- ----- ----- > 64K 7581.14 0.746 1.105 5547.66 1.491 1.495 > 128K 7478.37 0.871 1.116 6429.84 1.282 1.291 > 256K 7427.38 0.946 1.115 6917.20 1.197 1.201 > 512K 7310.14 1.122 1.129 7229.13 1.145 1.150 > 1M 7251.29 1.143 1.129 7457.95 0.996 1.109 > 2M 7249.27 1.146 1.133 7340.26 0.502 1.105 > 4M 7217.26 1.156 1.136 7322.63 0.397 1.096 > > In this case, BZCOPY send service demand is significantly > less for the largest message sizes, though the throughput > for large messages is not very different. > > However, with "-T 2,2", the result looks like this: > > SDP BZCOPY > -------------------- -------------------- > MESGSIZE TPUT LCL RMT TPUT LCL RMT > -------- ------- ----- ----- ------- ----- ----- > 64K 7599.40 0.841 1.114 5493.56 1.510 1.585 > 128K 7556.53 1.039 1.121 6483.12 1.274 1.325 > 256K 7155.13 1.128 1.180 6996.30 1.180 1.220 > 512K 5984.26 1.357 1.277 7285.86 1.130 1.166 > 1M 5641.28 1.443 1.343 7250.43 0.811 1.141 > 2M 5657.98 1.439 1.387 7265.85 0.492 1.127 > 4M 5623.94 1.447 1.370 7274.43 0.385 1.112 > > For BZCOPY, the results are pretty similar; but for SDP, > the service demands are much higher, and the throughputs > have dropped dramatically relative to "-T 0,0". > > In either case, though, BZCOPY is more efficient for > large messages. > > Cheers, > Craig > > _______________________________________________ > general mailing list > [email protected] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
