Re: [ofa-general] Support for CXGB3 RNIC on P6
Sorry for the vagueness. There are no messages in the /var/log (or the console/dmesg). To answer Jon's question: lspci doesn't show the device. No lights come up on the adapter, which I guess is the normal behavior till the device is succesfully probed and recognized. thanks, - KK Roland Dreier rdre...@cisco.com wrote on 02/03/2009 09:42:39 PM: Roland Dreier rdre...@cisco.com 02/03/2009 09:42 PM To Krishna Kumar2/India/i...@ibmin cc openfabrics general@lists.openfabrics.org Subject Re: [ofa-general] Support for CXGB3 RNIC on P6 My colleague (at a different site) is trying to get couple of Chelsio RNIC adapters working on p6 systems but for some reason the cards aren't recognized on bootup. What do you mean by aren't recognized on bootup? More details on the specific problem are needed to diagnose it. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Support for CXGB3 RNIC on P6
Hi, My colleague (at a different site) is trying to get couple of Chelsio RNIC adapters working on p6 systems but for some reason the cards aren't recognized on bootup. The same cards works on my xseries systems, and following are the messages I get (there are no messages on his p6 systems): Feb 1 11:42:49 localhost kernel: Chelsio T3 Network Driver - version 1.1.1-ko Feb 1 11:42:49 localhost kernel: cxgb3 :22:00.0: PCI INT A - GSI 17 (level, low) - IRQ 17 Feb 1 11:42:49 localhost kernel: input: Power Button (FF) as /class/input/input1 Feb 1 11:42:49 localhost kernel: ACPI: Power Button (FF) [PWRF] Feb 1 11:42:49 localhost kernel: cxgb3 :22:00.0: Port 0 using 4 queue sets. Feb 1 11:42:49 localhost kernel: eth2: Chelsio T310 10GBASE-R RNIC (rev 4) PCI Express x8 MSI-X Feb 1 11:42:49 localhost kernel: eth2: 128MB CM, 256MB PMTX, 256MB PMRX, S/N: PT49070050 Is this revision of cxgb3 (rev4) not supported on p6? Or are we missing something to get it to work? thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Test programs supporting RNIC's.
Hi Dotan, ib_clock_test ib_read_lat ib_write_bw_postlist ib_rdma_bwib_send_bwib_write_lat ib_rdma_lat ib_send_lat ib_read_bw Out of this, only ib_rdma_bw seems to be CMA enabled. Is this the only program that supports RNIC's? Yes. I know that there are planing to add the CMA support to all of the ib_* applications as well (but today, only ib_rdma_bw and ib_rdma_lat support it). Yes, I had forgotten ib_rdma_lat on the list of CMA enabled apps. But somehow it didn't work for me. I need to reboot to the distro OS and locate the error, will post it later. Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Test programs supporting RNIC's.
Hi Steve, Steve Wise [EMAIL PROTECTED] wrote on 04/09/2008 07:33:35 PM: Krishna, if you are interested, you could add cma support to the rest of these. I can help by answering questions and/or testing things... If no one else is already doing this, I can start doing this in the background. Will follow up with you if I need any help. Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Test programs supporting RNIC's.
Hi, I am testing Chelsio cxgb3 RNICS on RHEL5.2 (beta). The following list of applications are installed on the system as part of OFED install option: ib_clock_test ib_read_lat ib_write_bw_postlist ib_rdma_bwib_send_bwib_write_lat ib_rdma_lat ib_send_lat ib_read_bw Out of this, only ib_rdma_bw seems to be CMA enabled. Is this the only program that supports RNIC's? Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: Status of NFS-RDMA ?
Hi James, Since you had mentioned in an earlier email that NFS-RDMA server side will be present in OFED1.4, do you know if any port of the server code to OFED1.3 (when it comes out) will happen? Is there any effort for that, any work ongoing, any help required, etc? I couldn't find the release time lines for OFED1.4, is there any link on openfabrics homepage? Thanks, - KK [EMAIL PROTECTED] wrote on 01/29/2008 08:23:46 PM: On Tue, 29 Jan 2008, Pawel Dziekonski wrote: On Mon, 28 Jan 2008 at 10:14:22AM -0500, James Lentini wrote: On Sat, 26 Jan 2008, Pawel Dziekonski wrote: I pulled Tom's tree from new url and build a kernel. If you enabled support for INFINIBAND drivers (IB and iWARP support) and NFS client/server support, the kernel should be ready to go (run grep RDMA /your_kernel_sources/.config to confirm that CONFIG_SUNRPC_XPRT_RDMA is either m or y). NFS/RDMA doesn't require OFED be installed. OFED is a release of the Linux kernel sources and some userspace libraries/tools. If you are then I downloaded OFED from http://www.mellanox.com/downloads/NFSoRDMA/OFED-1.2-NFS-RDMA.gz, I don't know what the above URL contains. The latest code is in Tom Tucker's tree (and now NFS server maintainer Bruce Fields tree). It is hi, back to subject on a proper mailing list. I have a 3 year experience with mellanox hardware and IBGold so I basically know what OFED is all about. up to now i was only using IBGold since IB drivers appeared in kernel pretty recently. You'll want to use the mainline kernel's IB drivers for NFS/RDMA. We've been developing the NFS/RDMA software on the OpenFabrics (aka OpenIB) code since it was merged into 2.6.10 in Dec 2004. currently I have new hardware. I'm running Tom's kernel and already did some MPI tests. SDP is not working, probably because sdp kernel modules where not build. ;) I understand that those modules are only available from ofa-kernel. please correct me if i'm wrong. Correct. SDP has never been submitted to mainline Linux. system is Scientic Linux 4.5, which is supposed to be a fully compatible RH4 clone. hardware is Supermicro mobos with Mellanox MT25204 and Flextronisc switch. error log from ofa-kernel build: Is your goal to build a kernel with an NFS/RDMA server? If so, the kernel sources from Tom Tucker's git tree are the ones you want, not the old OFED 1.2-based packages which are out of date. Did you try setting up the NFS/RDMA server on the kernel used for your MPI tests above? make[1]: Entering directory `/usr/src/ib/xprt-switch-2.6' test -e include/linux/autoconf.h -a -e include/config/auto.conf || (\ echo; \ echo ERROR: Kernel configuration is invalid.; \ echo include/linux/autoconf.h or include/config/auto.conf are missing.; \ echo Run 'make oldconfig make prepare' on kernel src to fix it.; \ echo; \ /bin/false) obviously, doing 'make oldconfig make prepare' does not help. anyway, above mentioned files do exist: # ls -la /usr/src/ib/xprt-switch-2.6/{include/linux/autoconf.h, include/config/auto.conf} -rw-r--r-- 1 root root 10156 Jan 25 17:42 /usr/src/ib/xprt-switch-2. 6/include/config/auto.conf -rw-r--r-- 1 root root 14733 Jan 25 17:42 /usr/src/ib/xprt-switch-2. 6/include/linux/autoconf.h despite of above, compilation continues but fails with: gcc -Wp,-MD,/var/tmp/OFEDRPM/BUILD/ofa_kernel-1. 2/drivers/infiniband/core/.mad.o.d -nostdinc -isystem /usr/lib/gcc/x86_64- redhat-linux/3.4.6/include -D__KERNEL__ -I/var/tmp/OFEDRPM/BUILD/ofa_kernel-1. 2/include -I/var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2 /drivers/infiniband/include -Iinclude-include include/linux/autoconf.h -include /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/include/linux/autoconf.h -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Werror- implicit-function-declaration -Os -m64 -mno-red-zone -mcmodel=kernel -pipe - Wno-sign-compare -fno-asynchronous-unwind-tables -funit-at-a-time -mno-sse - mno-mmx -mno-sse2 -mno-3dnow -maccumulate-outgoing-args -DCONFIG_AS_CFI=1 - DCONFIG_AS_CFI_SIGNAL_FRAME=1 -fomit-frame-pointer -Wdeclaration-after- statement -DMODULE -DKBUILD_STR(s)=#s - DKBUILD_BASENAME=KBUILD_STR(mad) -DKBUILD_MODNAME=KBUILD_STR(ib_mad) -c - o /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2/drivers/infiniband/core/.! tmp _mad.o /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2 /drivers/infiniband/core/mad.c /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2 /drivers/infiniband/core/mad.c: In function `ib_mad_init_module': /var/tmp/OFEDRPM/BUILD/ofa_kernel-1.2 /drivers/infiniband/core/mad.c: 2966: error: too many arguments to function `kmem_cache_create' make[4]: ***
[ofa-general] Re: Status of NFS-RDMA ?
Hi Jeff James, Great. If you let me know when the bits are ready (I don't always read the mailing list), I should be able to get some testing done. Thanks, - KK Jeff Becker [EMAIL PROTECTED] wrote on 01/30/2008 11:02:09 PM: Hi all. James Lentini wrote: On Wed, 30 Jan 2008, Krishna Kumar2 wrote: Hi James, Since you had mentioned in an earlier email that NFS-RDMA server side will be present in OFED1.4, Actually, that was Tziporet. do you know if any port of the server code to OFED1.3 (when it comes out) will happen? Is there any effort for that, any work ongoing, any help required, etc? Jeff Becker had looked into this. We would definitely appreciate the help. I have set up a git tree for NFSoRDMA and succesfully merged it with, and built it on OFED 1.3-rcx. I'm currently doing the backports (SLES 10 SP1 first). All this is in preparation for OFED 1.4, as that is when NFSoRDMA will be included in OFED. I think I have this patching/backporting stuff under control. However, my testing resources are limited. Thus depending on your platform, I might be able to point you at OFED 1.3 based bits for testing if/when they are ready. Thanks. -jeff The NFS framework has changed significantly in several areas in recent kernel releases. This has made backporting the NFS/RDMA code to older kernels challenging. If you are interested in working on OFED1.3 support, let us know. I couldn't find the release time lines for OFED1.4, is there any link on openfabrics homepage? I'm not involved with the OFED1.4 planning. Tziporet, is there information on this? Thanks, - KK [EMAIL PROTECTED] wrote on 01/29/2008 08:23:46 PM: On Tue, 29 Jan 2008, Pawel Dziekonski wrote: On Mon, 28 Jan 2008 at 10:14:22AM -0500, James Lentini wrote: On Sat, 26 Jan 2008, Pawel Dziekonski wrote: I pulled Tom's tree from new url and build a kernel. If you enabled support for INFINIBAND drivers (IB and iWARP support) and NFS client/server support, the kernel should be ready to go (run grep RDMA /your_kernel_sources/.config to confirm that CONFIG_SUNRPC_XPRT_RDMA is either m or y). NFS/RDMA doesn't require OFED be installed. OFED is a release of the Linux kernel sources and some userspace libraries/tools. If you are then I downloaded OFED from http://www.mellanox.com/downloads/NFSoRDMA/OFED-1.2-NFS-RDMA.gz, I don't know what the above URL contains. The latest code is in Tom Tucker's tree (and now NFS server maintainer Bruce Fields tree). It is hi, back to subject on a proper mailing list. I have a 3 year experience with mellanox hardware and IBGold so I basically know what OFED is all about. up to now i was only using IBGold since IB drivers appeared in kernel pretty recently. You'll want to use the mainline kernel's IB drivers for NFS/RDMA. We've been developing the NFS/RDMA software on the OpenFabrics (aka OpenIB) code since it was merged into 2.6.10 in Dec 2004. currently I have new hardware. I'm running Tom's kernel and already did some MPI tests. SDP is not working, probably because sdp kernel modules where not build. ;) I understand that those modules are only available from ofa-kernel. please correct me if i'm wrong. Correct. SDP has never been submitted to mainline Linux. system is Scientic Linux 4.5, which is supposed to be a fully compatible RH4 clone. hardware is Supermicro mobos with Mellanox MT25204 and Flextronisc switch. error log from ofa-kernel build: Is your goal to build a kernel with an NFS/RDMA server? If so, the kernel sources from Tom Tucker's git tree are the ones you want, not the old OFED 1.2-based packages which are out of date. Did you try setting up the NFS/RDMA server on the kernel used for your MPI tests above? make[1]: Entering directory `/usr/src/ib/xprt-switch-2.6' test -e include/linux/autoconf.h -a -e include/config/auto.conf || (\ echo; \ echo ERROR: Kernel configuration is invalid.; \ echo include/linux/autoconf.h or include/config/auto.conf are missing.; \ echo Run 'make oldconfig make prepare' on kernel src to fix it.; \ echo; \ /bin/false) obviously, doing 'make oldconfig make prepare' does not help. anyway, above mentioned files do exist: # ls -la /usr/src/ib/xprt-switch-2.6/{include/linux/autoconf.h, include/config/auto.conf} -rw-r--r-- 1 root root 10156 Jan 25 17:42 /usr/src/ib/xprt-switch-2. 6/include/config/auto.conf -rw-r--r-- 1 root root 14733 Jan 25 17:42 /usr/src/ib/xprt-switch-2. 6/include/linux/autoconf.h despite of above, compilation continues but fails with: gcc -Wp,-MD,/var/tmp/OFEDRPM/BUILD/ofa_kernel-1. 2/drivers
Re: [ofa-general] [PATCH] ipoib: Bug fix in ipoib_poll - resend
Hi Eli, t = min(IPOIB_NUM_WC, max); n = ib_poll_cq(priv-cq, t, priv-ibwc); + if (unlikely(n 1)) + break; for (i = 0; i n; i++) { struct ib_wc *wc = priv-ibwc + i; The 'for' loop (followed by the if (n != t) break check) should take care of this, isn't it? Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: [PATCH] IPoIB: Remove redundant check in xmit handler
Hi Roland, I am not sure if my answer was clear, so I will try to be clearer: qdisc_run() first checks netif_queue_stopped(dev), and then if it can get the __LINK_STATE_QDISC_RUNNING bit, it calls __qdisc_run() which will do the actual xmit. Subsequent calls to xmit within __qdisc_run checks for netif_queue_stopped. So there is no way that xmit can be called with a stopped queue as the core does it for every skb. And no other cpu can execute this at the same time as the RUNNING bit is held. So this is a completely safe removal of check for every skb. I have tested this code extensively as part of batching skbs and have never hit it. Thanks, - KK Hi Roland, This check was added because of a real problem seen in practice a while ago. Has something changed in the tx queue locking that makes it redundant now? I am not sure of how it was earlier, but currently a device's xmit can be called only on one cpu at a time (by holding the __LINK_STATE_QDISC_RUNNING bit in qdisc_run). And queue_stopped check is present before xmit. I seem to remember that I could make the problem race trigger pretty fast by making the tx queue very small so that it got stopped a lot. I just tested with a smaller queue size (tx queue size=4), put a debug in the queue_stopped check in xmit(), and a counter to find how many times the queue was stopped (in ipoib_send). After a 20 min test run with 64 threads, the queue was stopped 16.5 million times, but the debug never hit. I tested with buffer sizes varying from 128 to 16K bytes (though TSO/GSO is not implemented in IPoIB anyway). Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: [PATCH] IPoIB: Remove redundant check in xmit handler
Hi Roland, Roland Dreier [EMAIL PROTECTED] wrote on 11/20/2007 09:55:00 AM: qdisc_run() first checks netif_queue_stopped(dev), and then if it can get the __LINK_STATE_QDISC_RUNNING bit, it calls __qdisc_run() which will do the actual xmit. Subsequent calls to xmit within __qdisc_run checks for netif_queue_stopped. So there is no way that xmit can be called with a stopped queue as the core does it for every skb. And no other cpu can execute this at the same time as the RUNNING bit is held. So this is a completely safe removal of check for every skb. Hmm, I don't see any changes that prevent the race I originally described in http://oss.sgi.com/archives/netdev/2004-12/msg00474.html Maybe your test may just not be able to hit the race, or am I missing something? (Thanks for the link, I hadn't seen this earlier) That race will not happen as CPU#2 cannot call qdisc_restart when CPU#1 is holding the RUNNING bit. In this case, CPU#2 simply queue's the skb to dev-q and returns, while CPU#1 finds this new skb in it's iteration of __qdisc_run (and ends up processing all queue'd skb after checking for stopped queue). Anyway medium-term I want to move IPoIB away from LLTX so this doesn't matter that much. Are you planning for 2.6.25? Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: [PATCH] IPoIB: Remove redundant check in xmit handler
I forgot to mention, maybe the RUNNING bit was added as part of this bug fix, which means no LLTX driver need to really check for this condition again in their xmit handler. That might explain the problem you faced then. thanks, - KK Krishna Kumar2/India/IBM wrote on 11/20/2007 10:57:31 AM: Hi Roland, Roland Dreier [EMAIL PROTECTED] wrote on 11/20/2007 09:55:00 AM: qdisc_run() first checks netif_queue_stopped(dev), and then if it can get the __LINK_STATE_QDISC_RUNNING bit, it calls __qdisc_run() which will do the actual xmit. Subsequent calls to xmit within __qdisc_run checks for netif_queue_stopped. So there is no way that xmit can be called with a stopped queue as the core does it for every skb. And no other cpu can execute this at the same time as the RUNNING bit is held. So this is a completely safe removal of check for every skb. Hmm, I don't see any changes that prevent the race I originally described in http://oss.sgi.com/archives/netdev/2004-12/msg00474.html Maybe your test may just not be able to hit the race, or am I missing something? (Thanks for the link, I hadn't seen this earlier) That race will not happen as CPU#2 cannot call qdisc_restart when CPU#1 is holding the RUNNING bit. In this case, CPU#2 simply queue's the skb to dev-q and returns, while CPU#1 finds this new skb in it's iteration of __qdisc_run (and ends up processing all queue'd skb after checking for stopped queue). Anyway medium-term I want to move IPoIB away from LLTX so this doesn't matter that much. Are you planning for 2.6.25? Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
Hi Dave, David Miller wrote on 10/10/2007 02:13:31 AM: Hopefully that new qdisc will just use the TX rings of the hardware directly. They are typically large enough these days. That might avoid some locking in this critical path. Indeed, I also realized last night that for the default qdiscs we do a lot of stupid useless work. If the queue is a FIFO and the device can take packets, we should send it directly and not stick it into the qdisc at all. Since you are talking of how it should be done in the *current* code, I feel LLTX drivers will not work nicely with this. Actually I was trying this change a couple of weeks back, but felt that doin go would result in out of order packets (skbs present in q which were not sent out for LLTX failure will be sent out only at next net_tx_action, while other skbs are sent ahead). One option is to first call qdisc_run() and then process this skb, but that is ugly (requeue handling). However I guess this can be done cleanly once LLTX is removed. Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCHES] TX batching
J Hadi Salim [EMAIL PROTECTED] wrote on 10/08/2007 07:35:20 PM: I dont see something from Krishna's approach that i can take and reuse. This maybe because my old approaches have evolved from the same path. There is a long list but as a sample: i used to do a lot more work while holding the queue lock which i have now moved post queue lock; i dont have any speacial interfaces/tricks just for batching, i provide hints to the core of how much the driver can take etc etc. I have offered Krishna co-authorship if he makes the IPOIB driver to work on my patches, that offer still stands if he chooses to take it. My feeling is that since the approaches are very different, it would be a good idea to test the two for performance. Do you mind me doing that? Ofcourse others and/or you are more than welcome to do the same. I had sent a note to you yesterday about this, please let me know either way. *** Previous mail ** Hi Jamal, If you don't mind, I am trying to run your approach vs mine to get some results for comparison. For starters, I am having issues with iperf when using your infrastructure code with my IPoIB driver - about 100MB is sent and then everything stops for some reason. The changes in the IPoIB driver that I made to support batching is to set BTX, set xmit_win, and dynamically reduce xmit_win on every xmit and increase xmit_win on every xmit completion. Is there anything else that is required from the driver? thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: [PATCH 2/3][NET_BATCH] net core use batching
Hi Peter, Waskiewicz Jr, Peter P [EMAIL PROTECTED] wrote on 10/09/2007 04:03:42 AM: true, that needs some resolution. Heres a hand-waving thought: Assuming all packets of a specific map end up in the same qdiscn queue, it seems feasible to ask the qdisc scheduler to give us enough packages (ive seen people use that terms to refer to packets) for each hardware ring's available space. With the patches i posted, i do that via dev-xmit_win that assumes only one view of the driver; essentially a single ring. If that is doable, then it is up to the driver to say i have space for 5 in ring[0], 10 in ring[1] 0 in ring[2] based on what scheduling scheme the driver implements - the dev-blist can stay the same. Its a handwave, so there may be issues there and there could be better ways to handle this. Note: The other issue that needs resolving that i raised earlier was in regards to multiqueue running on multiple cpus servicing different rings concurently. I can see the qdisc being modified to send batches per queue_mapping. This shouldn't be too difficult, and if we had the xmit_win per queue (in the subqueue struct like Dave pointed out). I hope my understanding of multiqueue is correct for this mail to make sense :-) Isn't it enough that the multiqueue+batching drivers handle skbs belonging to different queue's themselves, instead of qdisc having to figure that out? This will reduce costs for most skbs that are neither batched nor sent to multiqueue devices. Eg, driver can keep processing skbs and put to the correct tx_queue as long as mapping remains the same. If the mapping changes, it posts earlier skbs (with the correct lock) and then iterates for the other skbs that have the next different mapping, and so on. (This is required only if driver is supposed to transmit 1 skb in one call, otherwise it is not an issue) Alternatively, supporting drivers could return a different code on mapping change, like: NETDEV_TX_MAPPING_CHANGED (for batching only) so that qdisc_run() could retry. Would that work? Secondly having xmit_win per queue: would it help in multiple skb case? Currently there is no way to tell qdisc to dequeue skbs from a particular band - it returns skb from highest priority band. thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
Hi Dave, David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM: Isn't it enough that the multiqueue+batching drivers handle skbs belonging to different queue's themselves, instead of qdisc having to figure that out? This will reduce costs for most skbs that are neither batched nor sent to multiqueue devices. Eg, driver can keep processing skbs and put to the correct tx_queue as long as mapping remains the same. If the mapping changes, it posts earlier skbs (with the correct lock) and then iterates for the other skbs that have the next different mapping, and so on. The complexity in most of these suggestions is beginning to drive me a bit crazy :-) This should be the simplest thing in the world, when TX queue has space, give it packets. Period. When I hear suggestions like have the driver pick the queue in -hard_start_xmit() and return some special status if the queue becomes different. you know, I really begin to wonder :-) If we have to go back, get into the queueing layer locks, have these special cases, and whatnot, what's the point? I understand your point, but the qdisc code itself needs almost no change, as small as: qdisc_restart() { ... case NETDEV_TX_MAPPING_CHANGED: /* * Driver sent some skbs from one mapping, and found others * are for different queue_mapping. Try again. */ ret = 1; /* guaranteed to have atleast 1 skb in batch list */ break; ... } Alternatively if the driver does all the dirty work, qdisc needs no change at all. However, I am not sure if this addresses all the concerns raised by you, Peter, Jamal, others. This code should eventually be able to run lockless all the way to the TX queue handling code of the driver. The queueing code should know what TX queue the packet will be bound for, and always precisely invoke the driver in a state where the driver can accept the packet. This sounds like a good idea :) I need to think more on this, esp as my batching sends multiple skbs of possibly different mappings to device, and those skbs stay in batch list if driver couldn't send them out. thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM: Ignore LLTX, it sucks, it was a big mistake, and we will get rid of it. Great, this will make life easy. Any idea how long that would take? It seems simple enough to do. thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 1/3] [NET_BATCH] Introduce batching interface
Hi Jamal, If you don't mind, I am trying to run your approach vs mine to get some results for comparison. For starters, I am having issues with iperf when using your infrastructure code with my IPoIB driver - about 100MB is sent and then everything stops for some reason. The changes in the IPoIB driver that I made to support batching is to set BTX, set xmit_win, and dynamically reduce xmit_win on every xmit and increase xmit_win on every xmit completion. Is there anything else that is required from the driver? thanks, - KK J Hadi Salim [EMAIL PROTECTED] wrote on 10/08/2007 12:06:23 AM: This patch introduces the netdevice interface for batching. cheers, jamal [attachment oct07-p1of3 deleted by Krishna Kumar2/India/IBM] ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
J Hadi Salim [EMAIL PROTECTED] wrote on 10/08/2007 06:47:24 PM: two, there should _never_ be any requeueing even if LLTX in the previous patches when i supported them; if there is, it is a bug. This is because we dont send more than what the driver asked for via xmit_win. So if it asked for more than it can handle, that is a bug. If its available space changes while we are sending to it, that too is a bug. Driver might ask for 10 and we send 10, but LLTX driver might fail to get lock and return TX_LOCKED. I haven't seen your code in greater detail, but don't you requeue in that case too? - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
jamal wrote: + while ((skb = __skb_dequeue(skbs)) != NULL) + q-ops-requeue(skb, q); -requeue queues at the head, so this looks like it would reverse the order of the skbs. Excellent catch! thanks; i will fix. As a side note: Any batching driver should _never_ have to requeue; if it does it is buggy. And the non-batching ones if they ever requeue will be a single packet, so not much reordering. On the contrary, batching LLTX drivers (if that is not ruled out) will very often requeue resulting in heavy reordering. Fix looks good though. - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000
Hi Or, Sorry about the delay, I ran into various bugs and then system froze for about 2 days. Or Gerlitz [EMAIL PROTECTED] wrote on 09/17/2007 03:22:58 PM: good, please test with rev5 and let us know. I tested with rev5 and this is what I found (different from what I said earlier about EHCA): Original code had almost zero retransmission (for a particular run), 1 for EHCA and 0 for MTHCA. With batching, both had high retransmissions: 73680 for EHCA and 70268 for MTHCA. It seems I was wrong when I said EHCA was having no issues. So far I have identical retransmission numbers for E1000 only. transmission of 4K batched packets sounds like a real problem for the receiver side, with 0.5K send/recv queue size, its 8 batches of 512 packets each were for each RX there is completion (WC) to process, SKB to alloc and post to the QP where for the TX there's only posting to the QP, processes one (?) WC and free 512 SKBs. The receiver and sender both have 4K WR's. I had earlier changed batching so that IPoIB will send atmost 2 skbs even if more are present in the queue and send 2 more after the first two and so on. But that too gave high numbers for retransmissions. If indeed the situation is so unsymmetrical, I am starting to think that the CPU utilization at the sender side might be much higher with batching then without batching, have you looked into that? Overall it is almost the same. I had used netperf (about 1 month back) and it gave almost same numbers. I haven't tried recently. Even in regular code, though batching is not done, qdisc_restart() does xmit in a tight loop. The only difference is that dev-queue_lock is DROPPED/GOT for each skb, and dev-tx_lock is held for shorter times. I avoid the former and have no control for the latter. I am not with you. Looking on 2.6.22 and 2.6.23-rc5, for both their ipoib-NAPI mechanism is implemented through the function ipoib_poll being the polling api for the network stack etc, so what is the old API and where does this difference exist? I meant the pre-Stephen-Hemminger converted NAPI. He had changed the old NAPI to newer one (where driver doesn't get *budget, etc). You might want to try something lighter such as iperf udp test, where a nice criteria would be to compare bandwidth AND packet loss between no-batching and batching. As for the MTU, the default is indeed 2K (2044) but its always to just know the facts, namely what was the mtu during the test. OK, that is a good idea. I will try it over the weekend. if you have user space libraries installed, load ib_uverbs and run the command ibv_devinfo, you will see all the infiniband devices on your system and for each its device id and firmware version. If not, you should be looking on /sys/class/infiniband/$device/hca_type and /sys/class/infiniband/$device/fw_ver Both these files are not present, though ehca0 is present. For mthca, the values are : MT23108 3.5.0. Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [Bug, PATCH and another Bug] Was: Fix refcounting problem with netif_rx_reschedule()
Hi Jan-Bernd, Jan-Bernd Themann [EMAIL PROTECTED] wrote on 09/19/2007 06:53:48 PM: If I understood it right the problem you describe (quota update in __napi_schdule) can cause further problems when you choose the following numbers: CPU1: A. process 99 pkts CPU1: B. netif_rx_complete() CPU2: interrupt occures, netif_rx_schedule is called, net_rx_action triggerd: CPU2: C. set quota = 100 (__napi_schedule) CPU2: D. call poll(), process 1 pkt CPU2: D.2 call netif_rx_complete() (quota not exeeded) CPU2: E. net_rx_action: set quota=99 CPU1: F. net_rx_action: set qutoa=99 - 99 = 0 CPU1: G. modify list (list_move_tail) altough netif_rx_complete has been called Step G would fail as the device is not in the list due to netif_rx_complete. This case can occur for all devices running on an SMP system where interrupts are not pinned. I think list_move should be ok whether device is on the list or not. But it should not come to that code since work (99) != weight (100). If work == weight, then driver would not have done complete, and the next/prev would not be set to POISON. I like the clean changes made by Dave to fix this, and will test it today (if I can get my crashed system to come up). Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [Bug, PATCH and another Bug] Was: Fix refcounting problem with netif_rx_reschedule()
Hi Dave, David Miller [EMAIL PROTECTED] wrote on 09/19/2007 09:35:57 PM: The NAPI_STATE_SCHED flag bit should provide all of the necessary synchornization. Only the setter of that bit should add the NAPI instance to the polling list. The polling loop runs atomically on the cpu where the NAPI instance got added to the per-cpu polling list. And therefore decisions to complete NAPI are serialized too. That serialized completion decision is also when the list deletion occurs. About the list deletion occurs, isn't the race I mentioned still present? If done budget, the driver does netif_rx_complete (at which time some other cpu can add this NAPI to their list). But the first cpu might do some more actions on the napi, like ipoib_poll() calls request_notify_cq(priv-cq), when other cpu might have started using this napi. (net_rx_action's 'list_move' however will not execute since work != weight) Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [Bug, PATCH and another Bug] Was: Fix refcounting problem with netif_rx_reschedule()
Ran 4/16/64 thread iperf on latest bits with this patch and no issues after 30 mins. I used to consistently get the bug within 1-2 mins with just 4 threads prior to this patch. Tested-by: Krishna Kumar [EMAIL PROTECTED] (if any value in that) thanks, - KK David Miller [EMAIL PROTECTED] wrote on 09/20/2007 10:42:24 AM: From: Krishna Kumar2 [EMAIL PROTECTED] Date: Thu, 20 Sep 2007 10:40:33 +0530 I like the clean changes made by Dave to fix this, and will test it today (if I can get my crashed system to come up). I would very much appreciate this testing, as I'm rather sure we've plugged up the most serious holes at this point. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 1/2] IPoIB: Fix unregister_netdev hang
Hi Roland, Roland Dreier [EMAIL PROTECTED] wrote on 09/18/2007 07:57:24 PM: While using IPoIB over EHCA (rc6 bits), unregister_netdev hangs with I don't think you're actually using rc6 bits, since in your patch you have: -poll_more: and I think that is only in Dave's net-2.6.24 tree now, right? Nope, that was what I downloaded yesterday: VERSION = 2 PATCHLEVEL = 6 SUBLEVEL = 23 EXTRAVERSION =-rc6 NAME = Pink Farting Weasel +if (likely(!ib_req_notify_cq(priv-cq, + IB_CQ_NEXT_COMP | + IB_CQ_REPORT_MISSED_EVENTS))) It is possible for an interrupt to happen immediately right here, before the netif_rx_complete(), so that netif_rx_schedule() gets called while we are still on the poll list. +netif_rx_complete(dev, napi); To be clear, netif_rx_schedule while we are still in the poll list will not do any harm as it does nothing since NAPI_STATE_SCHED is still set (cleared by netif_rx_complete which has not yet run). Effectively we lost/delayed processing an interrupt, if I understood the code right. I agree with you on the new patch. thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 1/2] IPoIB: Fix unregister_netdev hang
Hi Roland, Please double check your tree. I just very carefully looked at my trees, and the poll_more: label is added in commit 6b460a71 ([NET]: Make NAPI polling independent of struct net_device objects.) which is only in the net-2.6.24 tree. Of course Dave did not change the version information in the Makefile since he wouldn't want Linus to pick up any extra strange changes when he pulls, so a net-2.6.24 tree will look like 2.6.23-rc6 as you quoted. And the refcounting bug I fixed is only in net-2.6.24. You are absolutely right. My wording was incorrect, I should have said net-2.6.24 (which is *at* rev rc6). To be clear, netif_rx_schedule while we are still in the poll list will not do any harm as it does nothing since NAPI_STATE_SCHED is still set (cleared by netif_rx_complete which has not yet run). Effectively we lost/delayed processing an interrupt, if I understood the code right. Right, we lose an interrupt, and since the CQ events are one-shot, we never get another one, and the interface is effectively dead. Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 3/10 REV5] [sched] Modify qdisc_run to support batching
Hi Evgeniy, Evgeniy Polyakov [EMAIL PROTECTED] wrote on 09/14/2007 05:45:19 PM: + if (skb-next) { + int count = 0; + + do { +struct sk_buff *nskb = skb-next; + +skb-next = nskb-next; +__skb_queue_tail(dev-skb_blist, nskb); +count++; + } while (skb-next); Could it be list_move()-like function for skb lists? I'm pretty sure if you change first and the last skbs and ke of the queue in one shot, result will be the same. I have to do a bit more like update count, etc, but I agree it is do-able. I had mentioned in my PATCH 0/10 that I will later try this suggestion that you provided last time. Actually how many skbs are usually batched in your load? It depends, eg when the tx lock is not got, I get batching of upto 8-10 skbs (assuming that tx lock was not got quite a few times). But when the queue gets blocked, I have seen batching upto 4K skbs (if tx_queue_len is 4K). + /* Reset destructor for kfree_skb to work */ + skb-destructor = DEV_GSO_CB(skb)-destructor; + kfree_skb(skb); Why do you free first skb in the chain? This is the gso code which has segmented 'skb' to skb'1-n', and those skb'1-n' are sent out and freed by driver, which means the dummy 'skb' (without any data) remains to be freed. Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 2/10 REV5] [core] Add skb_blist support for batching
Hi Evgeniy, Evgeniy Polyakov [EMAIL PROTECTED] wrote on 09/14/2007 06:16:38 PM: + if (dev-features NETIF_F_BATCH_SKBS) { + /* Driver supports batching skb */ + dev-skb_blist = kmalloc(sizeof *dev-skb_blist, GFP_KERNEL); + if (dev-skb_blist) + skb_queue_head_init(dev-skb_blist); + } + A nitpick is that you should use sizeof(struct ...) and I think it requires flag clearing in cae of failed initialization? I thought it is better to use *var name in case the name of the structure changes. Also, the flag is not cleared since I could try to enable batching later, and it could succeed at that time. When skb_blist is allocated, then batching is enabled otherwise it is disabled (while features flag just indicates that driver supports batching). Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 10/10 REV5] [E1000] Implement batching
Hi Evgeniy, Evgeniy Polyakov [EMAIL PROTECTED] wrote on 09/14/2007 06:17:14 PM: if (unlikely(skb-len = 0)) { dev_kfree_skb_any(skb); - return NETDEV_TX_OK; + return NETDEV_TX_DROPPED; } This changes could actually go as own patch, although not sure it is ever used. just a though, not a stopper. Since this flag is new and useful only for batching, I feel it is OK to include it in this patch. + if (!skb || (blist skb_queue_len(blist))) { + /* + * Either batching xmit call, or single skb case but there are + * skbs already in the batch list from previous failure to + * xmit - send the earlier skbs first to avoid out of order. + */ + if (skb) + __skb_queue_tail(blist, skb); + skb = __skb_dequeue(blist); Why is it put at the end? There is a bug that I had explained in rev4 (see XXX below) resulting in sending out skbs out of order. The fix is that if the driver gets called with a skb but there are older skbs already in the batch list (which failed to get sent out), send those skbs first before this one. Thanks, - KK [XXX] Dave had suggested to use batching only in the net_tx_action case. When I implemented that in earlier revisions, there were lots of TCP retransmissions (about 18,000 to every 1 in regular code). I found the reason for part of that problem as: skbs get queue'd up in dev-qdisc (when tx lock was not got or queue blocked); when net_tx_action is called later, it passes the batch list as argument to qdisc_run and this results in skbs being moved to the batch list; then batching xmit also fails due to tx lock failure; the next many regular xmit of a single skb will go through the fast path (pass NULL batch list to qdisc_run) and send those skbs out to the device while previous skbs are cooling their heels in the batch list. The first fix was to not pass NULL/batch-list to qdisc_run() but to always check whether skbs are present in batch list when trying to xmit. This reduced retransmissions by a third (from 18,000 to around 12,000), but led to another problem while testing - iperf transmit almost zero data for higher # of parallel flows like 64 or more (and when I run iperf for a 2 min run, it takes about 5-6 mins, and reports that it ran 0 secs and the amount of data transfered is a few MB's). I don't know why this happens with this being the only change (any ideas is very appreciated). The second fix that resolved this was to revert back to Dave's suggestion to use batching only in net_tx_action case, and modify the driver to see if skbs are present in batch list and to send them out first before sending the current skb. I still see huge retransmission for IPoIB (but not for E1000), though it has reduced to 12,000 from the earlier 18,000 number. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000
Hi Dave, David Miller [EMAIL PROTECTED] wrote on 09/17/2007 04:47:48 AM: The only major complaint I have about this patch series is that the IPoIB part should just be one big changeset. Otherwise the tree is not bisectable, for example the initial ipoib header file change breaks the build. Right, I will change it accordingly. On a lower priority, I question the indirection of skb_blist by making it a pointer. For what? Saving 12 bytes on 64-bit? That kmalloc()'d thing is a nearly guarenteed cache and/or TLB miss. Just inline the thing, we generally don't do crap like this anywhere else. The intention was to avoid having two flags (one that driver supports batching and second to indicate that batching is on/off). So I could test skb_blist as an indication of whether batching is on/off. But your point on cache miss is absolutely correct, and I will change this part to be inline. thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 1/10 REV5] [Doc] HOWTO Documentation for batching
Hi Randy, Randy Dunlap [EMAIL PROTECTED] wrote on 09/15/2007 12:07:09 AM: + To fix this problem, error cases where driver xmit gets called with a + skb must code as follows: + 1. If driver xmit cannot get tx lock, return NETDEV_TX_LOCKED + as usual. This allows qdisc to requeue the skb. + 2. If driver xmit got the lock but failed to send the skb, it + should return NETDEV_TX_BUSY but before that it should have + queue'd the skb to the batch list. In this case, the qdisc queued + does not requeue the skb. Since this was a new section that I added to the documentation, this error creeped up. Thanks for catching it, and review comments/ack-off :) thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000
Hi Or, So with ipoib/mthca you still see this 1 : 18.5K retransmission rate (with no noticeable retransmission increase for E1000) you were reporting at the V4 post?! if this is the case, I think it calls for further examination, where help from Mellanox could ease things, I guess. What I will do today/tomorrow is to run the rev5 (which I didn't run for mthca) on both ehca and mthca and get statistics and send it out. Otherwise what you stated is correct as far as rev4 goes. After giving latest details, I will appreciate any help from Mellanox developers. By saying that with ehca you see normal level retransmissions - 2 times the regular code do you mean 1 : 2 retransmission rate between batching to no batching? Correct, for every 1 retransmission in the regular code, I see two retransmissions in batching case (which I assume is due to overflow at the receiver side as I batch sometimes upto 4K skbs). I will post the exact numbers in the next post. I am not sure this was mentioned over the threads, but clearly two sides are needed for the dance here, namely I think you want to do your tests (both the no batching and with batching) with something like NAPI enabled at the receiver side, 2.6.23-rc5 has NAPI I was using 2.6.23-rc1 on receiver (which also has NAPI, but uses the old API - the same fn ipoib_poll()). is this with no delay set or not? connected or datagram mode? mtu? netperf command? system spec (specifically hca device id and fw version), etc? This is TCP (without No Delay), datagram mode, I didn't change mtu from the default (is it 2K?). Command is iperf with various options for different test buffer-size/threads. Regarding id/etc, this is what dmesg has: Sep 16 22:49:26 elm3b39 kernel: eHCA Infiniband Device Driver (Rel.: SVNEHCA_0023) Sep 16 22:49:26 elm3b39 kernel: xics_enable_irq: irq=36868: ibm_int_on returned -3 There are *fw* files for mthca0, but I don't see for ehca in /sys/class, so I am not sure (since these are pci-e cards, nothing shows up in lspci -v). What should I look for? Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000
[Removing Jeff as requested from thread :) ] Hi Dave, David Miller [EMAIL PROTECTED] wrote on 09/17/2007 07:55:02 AM: From: jamal [EMAIL PROTECTED] Date: Sun, 16 Sep 2007 22:14:21 -0400 I still think this work - despite my vested interest - needs more scrutiny from a performance perspective. Absolutely. There are tertiary issues I'm personally interested in, for example how well this stuff works when we enable software GSO on a non-TSO capable card. In such a case the GSO segment should be split right before we hit the driver and then all the sub-segments of the original GSO frame batched in one shot down to the device driver. In this way you'll get a large chunk of the benefit of TSO without explicit hardware support for the feature. There are several cards (some even 10GB) that will benefit immensely from this. I have tried this on ehca which does not support TSO. I added GSO flag at the ipoib layer (and that resulted in a panic/fix that is mentioned in this patchset). I will re-run tests for this and submit results. Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [RESEND] Implement batching skb API and support in IPoIB
Hi Dave, I am re-sending in case you didn't get this earlier. Also sending REV5 of the patch. I will send patch for e1000e on monday or tuesday after making the changes and testing over the weekend. thanks, - KK __ Hi Dave, David Miller [EMAIL PROTECTED] wrote on 08/29/2007 10:21:50 AM: From: Krishna Kumar2 [EMAIL PROTECTED] Date: Wed, 29 Aug 2007 08:53:30 +0530 I am scp'ng from 192.168.1.1 to 192.168.1.2 and captured at the send side. Bad choice of test, this is cpu limited since the scp has to encrypt and MAC hash all the data it sends. Use something like straight ftp or bw_tcp from lmbench. I used bw_tcp from lmbench-3. I transfered 500MB and captured the tcpdump, and analysis at various points gave pipeline sizes: 26064, 27792, 22888, 23168, 23448, 20272, 23168, 4344, 10136, 164792, 35920, 26344, 24336, 24336, 23168, 25784, 23168, There was one huge 164K, otherwise most were in smaller ranges like 20-30K. I ran the following test script: SERVER=192.168.1.2 BYTES=100m BUFFERSIZES=4096 16384 65536 131072 262144 PROCS=1 8 ITERATIONS=5 for m in $BUFFERSIZES do for procs in $PROCS do echo TEST: Size:$m Procs:$procs bw_tcp -N $ITERATIONS -m $m -M $BYTES -P $procs $SERVER done done Result is: Test without batching: # Size Procs BW (MB/s) 1 4096 1 117.39 2 16384 1 117.49 3 65536 1 117.55 4 131072 1 117.55 5 262144 1 117.58 6 4096 8 117.18 7 16384 8 117.47 8 65536 8 117.54 9 131072 8 117.59 10 262144 8 117.55 Test with batching: # Size Procs BW (MB/s) 1 4096 1 117.39 2 16384 1 117.48 3 65536 1 117.55 4 131072 1 117.58 5 262144 1 117.58 6 4096 8 117.19 7 16384 8 117.46 8 65536 8 117.53 9 131072 8 117.55 10 262144 8 117.60 So it doesn't seem to harm e1000. Can someone give a link to the E1000E driver? I couldn't find it after downloading Jeff's netdev-2.6 tree. Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
Hi Dave, I am scp'ng from 192.168.1.1 to 192.168.1.2 and captured at the send side. 192.168.1.1.37201 192.168.1.2.ssh: P 837178092:837178596(504) ack 1976304527 win 79 nop,nop,timestamp 36791208 123919397 192.168.1.1.37201 192.168.1.2.ssh: . 837178596:837181492(2896) ack 1976304527 win 79 nop,nop,timestamp 36791208 123919397 192.168.1.1.37201 192.168.1.2.ssh: . 837181492:837184388(2896) ack 1976304527 win 79 nop,nop,timestamp 36791208 123919397 192.168.1.1.37201 192.168.1.2.ssh: . 837184388:837188732(4344) ack 1976304527 win 79 nop,nop,timestamp 36791208 123919397 192.168.1.1.37201 192.168.1.2.ssh: . 837188732:837193076(4344) ack 1976304527 win 79 nop,nop,timestamp 36791208 123919397 192.168.1.1.37201 192.168.1.2.ssh: . 837193076:837194524(1448) ack 1976304527 win 79 nop,nop,timestamp 36791208 123919397 192.168.1.2.ssh 192.168.1.1.37201: . ack 837165060 win 3338 nop,nop,timestamp 123919397 36791208 Data in pipeline: 837194524 - 837165060 = 29464. In most cases, I am getting 7K, 8K, 13K, and rarely close to 16K. I ran iperf with 4K, 16K 32K (as I could do multiple threads instead of single process). Results are (for E1000 82547GI chipset, BW in KB/s): Test Org BW New BW% Size:4096 Procs:1 114612 114644 .02 Size:16394 Procs:1 114634 114644 0 Size:32768 Procs:1 114645 114643 0 And for multiple threads: TestOrg BW New BW % Size:4096 Procs:8 114632 114637 0 Size:4096 Procs:16 114639 114637 0 Size:4096 Procs:64 114893 114800 -.08 Size:16394 Procs:8 114641 114642 0 Size:16394 Procs:16 114642 114643 0 Size:16394 Procs:64 114911 114781 -.11 Size:32768 Procs:8 114638 114639 0 Size:32768 Procs:16 114642 114645 0 Size:32768 Procs:64 114932 114777 -.13 I will run netperf and report CPU utilization too. Thanks, - KK David Miller [EMAIL PROTECTED] wrote on 08/21/2007 12:48:24 PM: From: Krishna Kumar2 [EMAIL PROTECTED] Date: Fri, 17 Aug 2007 11:36:03 +0530 I ran 3 iterations of 45 sec tests (total 1 hour 16 min, but I will run a longer one tonight). The results are (results in KB/s, and %): I ran a 8.5 hours run with no batching + another 8.5 hours run with batching (Buffer sizes: 32 128 512 4096 16384, Threads: 1 8 32, Each test run time: 3 minutes, Iterations to average: 5). TCP seems to get a small improvement. Using 16K buffer size really isn't going to keep the pipe full enough for TSO. And realistically applications queue much more data at a time. Also, with smaller buffer sizes can have negative effects for the dynamic receive and send buffer growth algorithm the kernel uses, it might consider the connection application limited for too long. I would really prefer to see numbers that use buffer sizes more in line with the amount of data that is typically inflight on a 1G connection on a local network. Do a tcpdump during the height of the transfer to see about what this value is. When an ACK comes in, compare the sequence number it's ACK'ing with the sequence number of the most recently sent frame. The difference is approximately the pipe size at maximum congestion window assuming a loss free local network. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
[EMAIL PROTECTED] wrote on 08/29/2007 10:21:50 AM: From: Krishna Kumar2 [EMAIL PROTECTED] Date: Wed, 29 Aug 2007 08:53:30 +0530 I am scp'ng from 192.168.1.1 to 192.168.1.2 and captured at the send side. Bad choice of test, this is cpu limited since the scp has to encrypt and MAC hash all the data it sends. Use something like straight ftp or bw_tcp from lmbench. OK Using a different tool seems strange to me, why not just adjust the buffer size with command line options in the benchmark you were using in the first place? The reason was to run parallel copies, not for buffer limitations. Let me use the same tool for benchmark. Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
Hi Dave, David Miller [EMAIL PROTECTED] wrote on 08/22/2007 09:52:29 AM: From: Krishna Kumar2 [EMAIL PROTECTED] Date: Wed, 22 Aug 2007 09:41:52 +0530 snip Because TSO does batching already, so it's a very good tit for tat comparison of the new batching scheme vs. an existing one. I am planning to do more testing on your suggestion over the weekend, but I had a comment. Are you saying that TSO and batching should be mutually exclusive so hardware that doesn't support TSO (like IB) only would benefit? But even if they can co-exist, aren't cases like sending multiple small skbs better handled with batching? I'm not making any suggestions, so don't read that into anything I've said :-) I think the jury is still out, but seeing TSO perform even slightly worse with the batching changes in place would be very worrysome. This applies to both throughput and cpu utilization. Does turning off batching solve that problem? What I mean by that is: batching can be disabled if a TSO device is worse for some cases. Infact something that I had changed my latest code is to not enable batching in register_netdevice (in Rev4 which I am sending in a few mins), rather the user has to explicitly turn 'on' batching. Wondering if that is what you are concerned about. In any case, I will test your case on Monday (I am on vacation for next couple of days). Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
David Miller [EMAIL PROTECTED] wrote on 08/22/2007 02:44:40 PM: From: Krishna Kumar2 [EMAIL PROTECTED] Date: Wed, 22 Aug 2007 12:33:04 +0530 Does turning off batching solve that problem? What I mean by that is: batching can be disabled if a TSO device is worse for some cases. This new batching stuff isn't going to be enabled or disabled on a per-device basis just to get parity with how things are now. It should be enabled by default, and give at least as good performance as what can be obtained right now. That was how it was in earlier revisions. In revision4 I coded it so that it is enabled only if explicitly set by the user. I can revert that change. Otherwise it's a clear regression. Definitely. For drivers that support it, it should not reduce performance. Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 10/10 Rev4] [E1000] Implement batching
Hi Auke, Kok, Auke [EMAIL PROTECTED] wrote on 08/22/2007 08:09:31 PM: Krishna, while I appreciate the patch I would have preferred a patch to e1000e. Not only does the e1000e driver remove a lot of the workarounds for old silicon, it is also a good way for us to move the current e1000 driver into a bit more stable maintenance mode. Do you think you can write this patch for e1000e instead? code-wise a lot of things are still the same, so your patch should be relatively easy to generate. e1000e currently lives in a branch from jeff garzik's netdev-2.6 tree Definitely, I will pick it up and generate a patch. Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 1/10 Rev4] [Doc] HOWTO Documentation for batching
Hi Randy, Thanks for your suggestions. Will clean up those changes. - KK Randy Dunlap [EMAIL PROTECTED] wrote on 08/22/2007 09:20:13 PM: On Wed, 22 Aug 2007 13:58:58 +0530 Krishna Kumar wrote: Add Documentation describing batching skb xmit capability. Signed-off-by: Krishna Kumar [EMAIL PROTECTED] --- batching_skb_xmit.txt | 78 ++ 1 files changed, 78 insertions(+) diff -ruNp org/Documentation/networking/batching_skb_xmit.txt new/Documentation/networking/batching_skb_xmit.txt --- org/Documentation/networking/batching_skb_xmit.txt 1970-01-01 05:30: 00.0 +0530 +++ new/Documentation/networking/batching_skb_xmit.txt 2007-08-22 10:21: 19.0 +0530 @@ -0,0 +1,78 @@ + HOWTO for batching skb xmit support + --- + +Section 1: What is batching skb xmit +Section 2: How batching xmit works vs the regular xmit +Section 3: How drivers can support batching +Section 4: How users can work with batching + + +Introduction: Kernel support for batching skb +-- + +A new capability to support xmit of multiple skbs is provided in the netdevice +layer. Drivers which enable this capability should be able to process multiple +skbs in a single call to their xmit handler. + + +Section 1: What is batching skb xmit +- + + This capability is optionally enabled by a driver by setting the + NETIF_F_BATCH_SKBS bit in dev-features. The pre-requisite for a prerequisite + driver to use this capability is that it should have a reasonably I would say reasonably-sized. + sized hardware queue that can process multiple skbs. + + +Section 2: How batching xmit works vs the regular xmit +--- + + The network stack gets called from upper layer protocols with a single + skb to transmit. This skb is first enqueue'd and an attempt is made to enqueued + transmit it immediately (via qdisc_run). However, events like tx lock + contention, tx queue stopped, etc, can result in the skb not getting etc., + sent out and it remains in the queue. When the next xmit is called or + when the queue is re-enabled, qdisc_run could potentially find + multiple packets in the queue, and iteratively send them all out + one-by-one. + + Batching skb xmit is a mechanism to exploit this situation where all + skbs can be passed in one shot to the device. This reduces driver + processing, locking at the driver (or in stack for ~LLTX drivers) + gets amortized over multiple skbs, and in case of specific drivers + where every xmit results in a completion processing (like IPoIB) - + optimizations can be made in the driver to request a completion for + only the last skb that was sent which results in saving interrupts + for every (but the last) skb that was sent in the same batch. + + Batching can result in significant performance gains for systems that + have multiple data stream paths over the same network interface card. + + +Section 3: How drivers can support batching +- + + Batching requires the driver to set the NETIF_F_BATCH_SKBS bit in + dev-features. + + The driver's xmit handler should be modified to process multiple skbs + instead of one skb. The driver's xmit handler is called either with a an + skb to transmit or NULL skb, where the latter case should be handled + as a call to xmit multiple skbs. This is done by sending out all skbs + in the dev-skb_blist list (where it was added by the core stack). + + +Section 4: How users can work with batching +- + + Batching can be disabled for a particular device, e.g. on desktop + systems if only one stream of network activity for that device is + taking place, since performance could be slightly affected due to + extra processing that batching adds (unless packets are getting + sent fast resulting in stopped queue's). Batching can be enabled if queues). + more than one stream of network activity per device is being done, + e.g. on servers; or even desktop usage with multiple browser, chat, + file transfer sessions, etc. + + Per device batching can be enabled/disabled by passing 'on' or 'off' + respectively to ethtool. with what other parameter(s), e.g., ethtool dev batching on/off ? --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** ___ general mailing list
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
David Miller [EMAIL PROTECTED] wrote on 08/22/2007 12:21:43 AM: From: jamal [EMAIL PROTECTED] Date: Tue, 21 Aug 2007 08:30:22 -0400 On Tue, 2007-21-08 at 00:18 -0700, David Miller wrote: Using 16K buffer size really isn't going to keep the pipe full enough for TSO. Why the comparison with TSO (or GSO for that matter)? Because TSO does batching already, so it's a very good tit for tat comparison of the new batching scheme vs. an existing one. I am planning to do more testing on your suggestion over the weekend, but I had a comment. Are you saying that TSO and batching should be mutually exclusive so hardware that doesn't support TSO (like IB) only would benefit? But even if they can co-exist, aren't cases like sending multiple small skbs better handled with batching? Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
Hi Dave, I ran 3 iterations of 45 sec tests (total 1 hour 16 min, but I will run a longer one tonight). The results are (results in KB/s, and %): I ran a 8.5 hours run with no batching + another 8.5 hours run with batching (Buffer sizes: 32 128 512 4096 16384, Threads: 1 8 32, Each test run time: 3 minutes, Iterations to average: 5). TCP seems to get a small improvement. Thanks, - KK --- TCP --- Size:32 Procs:1 3415 3321 -2.75 Size:128 Procs:1 13094 13388 2.24 Size:512 Procs:149037 50683 3.35 Size:4096 Procs:1 114646 114619 -.02 Size:16384 Procs:1114626 114644 .01 Size:32 Procs:8 22675 22633 -.18 Size:128 Procs:877994 77297 -.89 Size:512 Procs:8114716 114711 0 Size:4096 Procs:8 114637 114636 0 Size:16384 Procs:8 95814 114638 19.64 Size:32 Procs:3223240 23349 .46 Size:128 Procs:32 82284 82247 -.04 Size:512 Procs:32 114885 114769 -.10 Size:4096 Procs:32 95735 114634 19.74 Size:16384 Procs:32 114736 114641 -.08 Average:115153411902103.36% --- No Delay: - Size:32 Procs:1 3002 2873 -4.29 Size:128 Procs:1 11853 11801 -.43 Size:512 Procs:1 45565 45837 .59 Size:4096 Procs:1 114511 114485 -.02 Size:16384 Procs:1114521 114555 .02 Size:32 Procs:8 8026 8029 .03 Size:128 Procs:8 31589 31573 -.05 Size:512 Procs:8 111506 105766 -5.14 Size:4096 Procs:8 114455 114454 0 Size:16384 Procs:895833 114491 19.46 Size:32 Procs:328005 8027 .27 Size:128 Procs:32 31475 31505 .09 Size:512 Procs:32 114558 113687 -.76 Size:4096 Procs:32114784 114447 -.29 Size:16384 Procs:32 114719 114496 -.19 Average: 10460261034402-1.11% --- ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
Hi Dave, David Miller [EMAIL PROTECTED] wrote on 08/08/2007 04:19:00 PM: From: Krishna Kumar [EMAIL PROTECTED] Date: Wed, 08 Aug 2007 15:01:14 +0530 RESULTS: The performance improvement for TCP No Delay is in the range of -8% to 320% (with -8% being the sole negative), with many individual tests giving 50% or more improvement (I think it is to do with the hw slots getting full quicker resulting in more batching when the queue gets woken). The results for TCP is in the range of -11% to 93%, with most of the tests (8/12) giving improvements. Not because I think it obviates your work, but rather because I'm curious, could you test a TSO-in-hardware driver converted to batching and see how TSO alone compares to batching for a pure TCP workload? I personally don't think it will help for that case at all as TSO likely does better job of coalescing the work _and_ reducing bus traffic as well as work in the TCP stack. I used E1000 (guess the choice is OK as e1000_tso returns TRUE. My hw is 82547GI). You are right, it doesn't help TSO case at all (infact degrades). Two things to note though: - E1000 may not be suitable for adding batching (which is no longer a new API, as I have changed it already). - Small skbs where TSO doesn't come into picture still seems to improve. A couple of cases for large skbs did result in some improved (like 4K, TCP No Delay, 32 procs). [Total segments retransmission for original code test run: 2220 for new code test run: 1620. So the retransmission problem that I was getting seems to be an IPoIB bug, though I did have to fix one bug in my networking component where I was calling qdisc_run(NULL) for regular xmit path and change to always use batching. The problem is that skb1 - skb10 may be present in the queue after each of them failed to be sent out, then net_tx_action fires which batches all of these into the blist and tries to send them out again, which also fails (eg tx lock fail or queue full), then the next single skb xmit will send the latest skb ignoring the 10 skbs that are already waiting in the batching list. These 10 skbs are sent out only the next time net_tx_action is called, so out of order skbs result. This fix reduced retransmissions from 180,000 to 55,000 or so. When I changed IPoIB driver to use iterative sends of each skb instead of creating multiple Work Request's, that number went down to 15]. I ran 3 iterations of 45 sec tests (total 1 hour 16 min, but I will run a longer one tonight). The results are (results in KB/s, and %): Test Case Org BW New BW % Change TCP Size:32 Procs:1 18483918112.01 Size:32 Procs:8 21888 21555 -1.52 Size:32 Procs:3219317 22433 16.13 Size:256 Procs:115584 25991 66.78 Size:256 Procs:8110937 74565 -32.78 Size:256 Procs:32 105767 98967 -6.42 Size:4096 Procs:1 81910 96073 17.29 Size:4096 Procs:8 113302 94040 -17.00 Size:4096 Procs:32 109664 105522 -3.77 TCP No Delay: -- Size:32 Procs:1 2688317718.19 Size:32 Procs:8 656810588 61.20 Size:32 Procs:326573783819.24 Size:256 Procs:1786912724 61.69 Size:256 Procs:865652 45652 -30.46 Size:256 Procs:32 95114 112279 18.04 Size:4096 Procs:1 95302 84664 -11.16 Size:4096 Procs:8 19 89111 -19.80 Size:4096 Procs:32 109249 113919 4.27 I will submit Rev4 with suggested changes (including single merged API) on Thursday after some more testing. Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
Forgot to mention one thing: This fix reduced retransmissions from 180,000 to 55,000 or so. When I changed IPoIB driver to use iterative sends of each skb instead of creating multiple Work Request's, that number went down to 15]. This also reduced TCP No Delay performance from huge percentages like 200-400% and now is almost the same as original code. So fixing this problem in IPoIB (driver?) will enable to use the multiple Work Request Work Completion, rather than limiting batching to single WR/WC. thanks, - KK __ Hi Dave, David Miller [EMAIL PROTECTED] wrote on 08/08/2007 04:19:00 PM: From: Krishna Kumar [EMAIL PROTECTED] Date: Wed, 08 Aug 2007 15:01:14 +0530 RESULTS: The performance improvement for TCP No Delay is in the range of -8% to 320% (with -8% being the sole negative), with many individual tests giving 50% or more improvement (I think it is to do with the hw slots getting full quicker resulting in more batching when the queue gets woken). The results for TCP is in the range of -11% to 93%, with most of the tests (8/12) giving improvements. Not because I think it obviates your work, but rather because I'm curious, could you test a TSO-in-hardware driver converted to batching and see how TSO alone compares to batching for a pure TCP workload? I personally don't think it will help for that case at all as TSO likely does better job of coalescing the work _and_ reducing bus traffic as well as work in the TCP stack. I used E1000 (guess the choice is OK as e1000_tso returns TRUE. My hw is 82547GI). You are right, it doesn't help TSO case at all (infact degrades). Two things to note though: - E1000 may not be suitable for adding batching (which is no longer a new API, as I have changed it already). - Small skbs where TSO doesn't come into picture still seems to improve. A couple of cases for large skbs did result in some improved (like 4K, TCP No Delay, 32 procs). [Total segments retransmission for original code test run: 2220 for new code test run: 1620. So the retransmission problem that I was getting seems to be an IPoIB bug, though I did have to fix one bug in my networking component where I was calling qdisc_run(NULL) for regular xmit path and change to always use batching. The problem is that skb1 - skb10 may be present in the queue after each of them failed to be sent out, then net_tx_action fires which batches all of these into the blist and tries to send them out again, which also fails (eg tx lock fail or queue full), then the next single skb xmit will send the latest skb ignoring the 10 skbs that are already waiting in the batching list. These 10 skbs are sent out only the next time net_tx_action is called, so out of order skbs result. This fix reduced retransmissions from 180,000 to 55,000 or so. When I changed IPoIB driver to use iterative sends of each skb instead of creating multiple Work Request's, that number went down to 15]. I ran 3 iterations of 45 sec tests (total 1 hour 16 min, but I will run a longer one tonight). The results are (results in KB/s, and %): Test Case Org BW New BW % Change TCP Size:32 Procs:1 18483918112.01 Size:32 Procs:8 21888 21555 -1.52 Size:32 Procs:3219317 22433 16.13 Size:256 Procs:115584 25991 66.78 Size:256 Procs:8110937 74565 -32.78 Size:256 Procs:32 105767 98967 -6.42 Size:4096 Procs:1 81910 96073 17.29 Size:4096 Procs:8 113302 94040 -17.00 Size:4096 Procs:32 109664 105522 -3.77 TCP No Delay: -- Size:32 Procs:1 2688317718.19 Size:32 Procs:8 656810588 61.20 Size:32 Procs:326573783819.24 Size:256 Procs:1786912724 61.69 Size:256 Procs:865652 45652 -30.46 Size:256 Procs:32 95114 112279 18.04 Size:4096 Procs:1 95302 84664 -11.16 Size:4096 Procs:8 19 89111 -19.80 Size:4096 Procs:32 109249 113919 4.27 I will submit Rev4 with suggested changes (including single merged API) on Thursday after some more testing. Thanks, - KK - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
David Miller [EMAIL PROTECTED] wrote on 08/09/2007 09:57:27 AM: Patrick had suggested calling dev_hard_start_xmit() instead of conditionally calling the new API and to remove the new API entirely. The driver determines whether batching is required or not depending on (skb==NULL) or not. Would that approach be fine with this single interface goal ? It is a valid posibility. Note that this is similar to how we handle TSO, the driver sets the feature bit and in its -hard_start_xmit() it checks the SKB for the given offload property. Great, I will try to get rid of two paths entirely, and see how to re-arrange the code cleanly. thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Submitted Rev3 of Batching API
Hi all, I just submitted Rev3 of the batching skb patch with support for IPoIB. I am still getting huge number of retransmissions, 1,255207 vs 67, or 18,750 retransmissions in the new code for every single retransmission with the original code. But even with so many extra processing/skbs, I am getting around 20% overall BW improvement. I would really appreciate if some experts can review the code on the list, and also suggest something regarding the retransmissions. thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 2/9 Rev3] [core] Add skb_blist hard_start_xmit_batch
Hi Evgeniy, Evgeniy Polyakov [EMAIL PROTECTED] wrote on 08/08/2007 05:31:43 PM: +int dev_change_tx_batch_skb(struct net_device *dev, unsigned long new_batch_skb) +{ + int ret = 0; + struct sk_buff_head *blist; + + if (!dev-hard_start_xmit_batch) { + /* Driver doesn't support batching skb API */ + ret = -ENOTSUPP; + goto out; + } + + /* Handle invalid argument */ + if (new_batch_skb 0) { + ret = -EINVAL; + goto out; + } It is unsigned, how can it be less than zero? Yuck, originally I had it as int and changed to ulong and forgot to remove this check. And actually you use it just like a binary flag (casted to/from u32 in the code, btw), so why not using ethtool_value directly here? I still need to check if the value is changing, so the one check is needed. Later I am using it as a value directly. + /* Check if new value is same as the current */ + if (!!dev-skb_blist == !!new_batch_skb) + goto out; + + if (new_batch_skb + (blist = kmalloc(sizeof *blist, GFP_KERNEL)) == NULL) { + ret = -ENOMEM; + goto out; + } + + spin_lock(dev-queue_lock); + if (new_batch_skb) { + skb_queue_head_init(blist); + dev-skb_blist = blist; + } else + free_batching(dev); + spin_unlock(dev-queue_lock); This needs bh lock too, since blist is accessed by qdisc_restart. Yes, had it in the code, put it in the list of changes, but missed it for some reason :( +int dev_add_skb_to_blist(struct sk_buff *skb, struct net_device *dev) +{ + if (!list_empty(ptype_all)) + dev_queue_xmit_nit(skb, dev); + + if (netif_needs_gso(dev, skb)) { + if (unlikely(dev_gso_segment(skb))) { + kfree(skb); + return 0; + } + + if (skb-next) { + int count = 0; + + do { +struct sk_buff *nskb = skb-next; + +skb-next = nskb-next; +__skb_queue_tail(dev-skb_blist, nskb); +count++; + } while (skb-next); Is it possible to move list without iterating over each entry? Though I cannot see something obvious to do that, let me see if something is possible as it will make a good difference. thanks for your suggestions, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching
Evgeniy Polyakov [EMAIL PROTECTED] wrote on 08/08/2007 05:44:02 PM: On Wed, Aug 08, 2007 at 03:01:45PM +0530, Krishna Kumar ([EMAIL PROTECTED]) wrote: +static inline int get_skb(struct net_device *dev, struct Qdisc *q, + struct sk_buff_head *blist, struct sk_buff **skbp) +{ + if (likely(!blist || (!skb_queue_len(blist) qdisc_qlen(q) = 1))) { + return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL); + } else { + int max = dev-tx_queue_len - skb_queue_len(blist); + struct sk_buff *skb; + + while (max 0 (skb = dev_dequeue_skb(dev, q)) != NULL) + max -= dev_add_skb_to_blist(skb, dev); + + *skbp = NULL; + return 1; /* we have atleast one skb in blist */ + } +} Same here - is it possible to get a list in one go instead of pulling one-by-one, since it forces quite a few additional unneded lock get/releases. What about dev_dequeue_number_skb(dev, q, num), which will grab the lock and move a list of skbs from one queue to provided head. OK, I will try this out. @@ -158,7 +198,10 @@ static inline int qdisc_restart(struct n /* And release queue */ spin_unlock(dev-queue_lock); - ret = dev_hard_start_xmit(skb, dev); + if (likely(skb)) + ret = dev_hard_start_xmit(skb, dev); + else + ret = dev-hard_start_xmit_batch(dev); Perfectionism says that having array of two functions and calling one of them via array_func_pointer[!!skb] will be much faster. Just a though. It is actually much faster than if/else on x86 at least. Thinking about this - I will have to store the 2 pointer array in dev itself wasting some space, and also fn pointer will have wrong signature as one takes an extra argument. Will ponder some more :) thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
Herbert Xu [EMAIL PROTECTED] wrote on 08/08/2007 07:12:47 PM: On Wed, Aug 08, 2007 at 03:49:00AM -0700, David Miller wrote: Not because I think it obviates your work, but rather because I'm curious, could you test a TSO-in-hardware driver converted to batching and see how TSO alone compares to batching for a pure TCP workload? You could even lower the bar by disabling TSO and enabling software GSO. I will try with E1000 (though I didn't see improvement when I tested a long time back). The difference I expect is that TSO would help with large packets and not necessarily small/medium packets and not definitely in the case of multiple different skbs (as opposed to single large skb) getting queue'd. I think these are two different workloads. I personally don't think it will help for that case at all as TSO likely does better job of coalescing the work _and_ reducing bus traffic as well as work in the TCP stack. I agree. I suspect the bulk of the effort is in getting these skb's created and processed by the stack so that by the time that they're exiting the qdisc there's not much to be saved anymore. However, I am getting a large improvement for IPoIB specifically for this same case. The reason - batching will help only when queue gets full and stopped (and to a lesser extent if tx lock was not got, which results in fewer amount of batching that can be done). thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching
Hi Patrick, Patrick McHardy [EMAIL PROTECTED] wrote on 08/08/2007 07:35:17 PM: Krishna Kumar wrote: snip +static inline int get_skb(struct net_device *dev, struct Qdisc *q, + struct sk_buff_head *blist, struct sk_buff **skbp) +{ + if (likely(!blist || (!skb_queue_len(blist) qdisc_qlen(q) = 1))) { + return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL); + } else { + int max = dev-tx_queue_len - skb_queue_len(blist); + struct sk_buff *skb; + + while (max 0 (skb = dev_dequeue_skb(dev, q)) != NULL) + max -= dev_add_skb_to_blist(skb, dev); + + *skbp = NULL; + return 1; /* we have atleast one skb in blist */ + } +} I think you missed my previous reply to this in the flood of responses (or I missed your reply), so I'm copying it below: Sorry, but I didn't get your post on this point earlier (thanks for posting again). The entire idea of a single queue after qdiscs that is refilled independantly of transmissions times etc. make be worry a bit. By changing the timing you're effectively changing the qdiscs behaviour, at least in some cases. SFQ is a good example, but I believe it affects most work-conserving qdiscs. Think of this situation: 100 packets of flow 1 arrive 50 packets of flow 1 are sent 100 packets for flow 2 arrive remaining packets are sent On the wire you'll first see 50 packets of flow 1, than 100 packets alternate of flow 1 and 2, then 50 packets flow 2. With your additional queue all packets of flow 1 are pulled out of the qdisc immediately and put in the fifo. When the 100 packets of the second flow arrive they will also get pulled out immediately and are put in the fifo behind the remaining 50 packets of flow 1. So what you get on the wire is: 100 packets of flow 1 100 packets of flow 1 In normal case (qdisc run from xmit and not from tx_action), the code executing is the same as regular code without any difference in wire behavior due to the check: if (likely(!blist || (!skb_queue_len(blist) qdisc_qlen(q) = 1))) { return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL); (I always pass blist as NULL). With SFQ for the above scenario, my code will first send out 50 F1 skbs iteratively (blist==NULL), get blocked so that 50 skbs accumulate on F1 and 100 on F2, then when it is woken up, it will batch but it will pick up 50 F1 and 50 F2 skbs in alternate order and put in the queue, and finally pick up the remaining 50 F2 skbs and put those too in the queue. Since I am calling dev_dequeue_skb, I am assured of RR when using SFQ. The only time I would have 100 skbs from F1 in my queue (and hence sent out first) is if there are NO F2 skbs. So SFQ is without any effect. This is not completely avoidable of course, but you can and should limit the damage by only pulling out as much packets as the driver can take and have the driver stop the queue afterwards. I feel this cannot happen. Please correct me if I am wrong. Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
Hi Dave, David Miller [EMAIL PROTECTED] wrote on 08/09/2007 03:31:37 AM: What do you generally think of the patch/implementation ? :) We have two driver implementation paths on recieve and now we'll have two on send, and that's not a good trend. Correct. In an ideal world all the drivers would be NAPI and netif_rx() would only be used by tunneling drivers and similar in the protocol layers. And likewise all sends would go through -hard_start_xmit(). If you can come up with a long term strategy that gets rid of the special transmit method, that'd be great. We should make Linux network drivers easy to write, not more difficult by constantly adding most interfaces than we consolidate. I think that is a good top level view, and I agree with that. Patrick had suggested calling dev_hard_start_xmit() instead of conditionally calling the new API and to remove the new API entirely. The driver determines whether batching is required or not depending on (skb==NULL) or not. Would that approach be fine with this single interface goal ? Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [RFH] IPoIB retransmission when sending multiple WR's to device
(Request For Help) Hi, On the same topic that I wrote about earlier, I put debugs in my code to store all skbs in bufferA when enqueing multiple skbs, and store all skbs to bufferB just before doing post. During post, I compare the two buffers to make sure that I am not posting in the wrong order, and that never happens. But I am getting a huge amount of retransmissions anyway, and I don't understand why this happens. It is either the IPoIB driver changes I made, or driver that is processing multiple WR's, or firmware that is sending out of order. I don't see high retransmission when I run over E1000 (had ported E1000 driver too, just to make sure that the core networking code is not putting skbs out of order, and results confirms it is correct). The retransmission number is around 200 vs 6000-1 / sec for Original/New IPoIB drivers. The result of retransmission is affecting when I have high number of threads (like 64), but in small # of threads, even with the retransmission, I am getting very good improvement in throughput. But with 64 threads, I am getting a fall from 270 MBPs to 250 MBPs. Since I am really stuck for the last couple of weeks on this, can someone help by taking a quick look at my code (attached as just the functions that are changed), and/or suggest some way to debug/fix this ? Thanks, - KK (See attached file: CODE) CODE Description: Binary data ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [RFH] IPoIB retransmission when sending multiple WR's to device
Hi Roland, Roland Dreier [EMAIL PROTECTED] wrote on 08/02/2007 09:59:23 PM: On the same topic that I wrote about earlier, I put debugs in my code to store all skbs in bufferA when enqueing multiple skbs, and store all skbs to bufferB just before doing post. During post, I compare the two buffers to make sure that I am not posting in the wrong order, and that never happens. But I am getting a huge amount of retransmissions anyway, Why do you think the retransmissions are related to things being sent out of order? Is it possible you're just sending much faster and overrunning the receiver's queue of posted receives? I cannot be sure of that. But in regular code too, batching is done in qdisc_run() in a different sense - it sends out packets *iteratively*. In this case, I see only 225 retransmission for the entire run of all tests (while in my code, I see 100,000 or more (I think I gave wrong number earlier, this is the right one - 200 vs 100,000). Is there any way to avoid the situation you are talking about ? I am already setting recv_queue_size=4096 when loading ipoib (so for mthca too). Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [RFH] IPoIB retransmission when sending multiple WR's to device
Hi Roland, I did one more test to check the out-of-order theory. I changed my new API to be: /* Original code, unmodified */ ipoib_start_xmit() { original code } /* Added new xmit which is identical to original code but doesn't get the lock */ ipoib_start_xmit_nolock() { original code but without getting lock } /* Batching API */ ipoib_start_xmit_batch() { get_lock() while (skbs in queue) { ret = ipoib_start_xmit_nolock() } unlock() } This in effect is fast sends of multiple skbs while holding the lock once for all the skbs without going back to the IP stack. The only difference is that this creates one WR per skb instead of multiple WR's. I got identical retranmissions (around 240 for the full run) as compared to original code (225), while the multiple WR's code had 100,000 retransmissions. I agree that sending multiple WR's is still faster than this, but this is second best I could do, and still there is no increase in retransmissions. The tx_queue_len on ib0 is 8192, recv/send size are both 4K. The receiver shows no errors/drops in ifconfig nor netstat -s. Is there anything that can be concluded/suspected, or something else I could try ? The latest run of the batching API gave me 719,679 retransmissions for a 16 min test run of 16/64 threads (iperf), which comes to 600 retransmissions per sec, which is more than the retransmissions during the entire run for the regular code! thanks, - KK __ Hi Roland, Roland Dreier [EMAIL PROTECTED] wrote on 08/02/2007 09:59:23 PM: On the same topic that I wrote about earlier, I put debugs in my code to store all skbs in bufferA when enqueing multiple skbs, and store all skbs to bufferB just before doing post. During post, I compare the two buffers to make sure that I am not posting in the wrong order, and that never happens. But I am getting a huge amount of retransmissions anyway, Why do you think the retransmissions are related to things being sent out of order? Is it possible you're just sending much faster and overrunning the receiver's queue of posted receives? I cannot be sure of that. But in regular code too, batching is done in qdisc_run() in a different sense - it sends out packets *iteratively*. In this case, I see only 225 retransmission for the entire run of all tests (while in my code, I see 100,000 or more (I think I gave wrong number earlier, this is the right one - 200 vs 100,000). Is there any way to avoid the situation you are talking about ? I am already setting recv_queue_size=4096 when loading ipoib (so for mthca too). Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Processing multiple WR's in MTHCA (other drivers).
Hi, If I call post_send with multiple WR's (as a list of 'next' linked WR's) from IPoIB to the driver, is it guaranteed that those WR's will go out on the wire in the same sequence ? Will the send sequence be : WR[0], WR[1], WR[n]; or this cannot be guaranteed by the driver after it issues the doorbell ? The reason to ask is that when I modify the networking + ipoib code to send multiple skbs, I see huge number of retransmissions (original code: 250, modified code: 5000 or more) as though skbs were sent out of order, and any amount of staring at the code is not helping. The code to send multiple WR's is (hopefully readable): /* * Normally start_index is 0 to indicate send from WR[0], but if one post_send * fails, start_index is set to the index of the first untried WR by ipoib_send(), and * calls post_send() again. I have also not seen any failures from provider and a * subsequent retry that causes this retransmission to happen. */ static inline int post_send(struct ipoib_dev_priv *priv, u32 qpn, int start_index, int num_skbs, struct ib_send_wr **bad_wr) { int ret; struct ib_send_wr *last_wr, *next_wr; last_wr = priv-tx_wr[start_index + num_skbs - 1]; /* Set Completion Notification for last WR */ last_wr-send_flags = IB_SEND_SIGNALED; /* Terminate the last WR */ next_wr = last_wr-next; last_wr-next = NULL; /* Send all the WR's in one doorbell */ ret = ib_post_send(priv-qp, priv-tx_wr[start_index], bad_wr); /* Restore send_flags WR chain */ last_wr-send_flags = 0; last_wr-next = next_wr; return ret; } (I am also getting a completion with multiple WR's in each completion handler, which is good). Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: Processing multiple WR's in MTHCA (other drivers).
Yes, I saw that, and also 10.8.3. Just wondering if that is actually complaint in the hardware. The reason being the huge amount of retransmission that I am getting. thanks, - KK Michael S. Tsirkin [EMAIL PROTECTED] wrote on 08/01/2007 02:47:45 PM: Quoting Krishna Kumar2 [EMAIL PROTECTED]: Subject: Processing multiple WR's in MTHCA (other drivers). If I call post_send with multiple WR's (as a list of 'next' linked WR's) from IPoIB to the driver, is it guaranteed that those WR's will go out on the wire in the same sequence? See chapter 9.5 transaction ordering: C9-25: A requester shall transmit request messages in the order that the Work Queue Elements (WQEs) were posted. -- MST ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 00/10] Implement batching skb API
Jamal, This is silly. I am not responding to this type of presumptuous and insulting mails. Regards, - KK J Hadi Salim [EMAIL PROTECTED] wrote on 07/25/2007 12:58:20 AM: KK, On Tue, 2007-24-07 at 09:14 +0530, Krishna Kumar2 wrote: J Hadi Salim [EMAIL PROTECTED] wrote on 07/23/2007 06:02:01 PM: Actually you have not sent netperf results with prep and without prep. My results were based on pktgen (which i explained as testing the driver). I think depending on netperf without further analysis is simplistic. It was like me doing forwarding tests on these patches. So _which_ non-LLTX driver doesnt do that? ;- I have no idea since I haven't looked at all drivers. Can you tell which all non-LLTX drivers does that ? I stated this as the sole criterea. The few i have peeked at all do it. I also think the e1000 should be converted to be non-LLTX. The rest of netdev is screaming to kill LLTX. tun driver doesnt use it either - but i doubt that makes it bloat Adding extra code that is currently not usable (esp from a submission point) is bloat. So far i have converted 3 drivers, 1 of them doesnt use it. Two more driver conversions are on the way, they will both use it. How is this bloat again? A few emails back you said if only IPOIB can use batching then thats good enough justification. You waltz in, have the luxury of looking at my code, presentations, many discussions with me etc ... luxury ? I had implemented the entire thing even before knowing that you are working on something similar! and I had sent the first proposal to netdev, I saw your patch at the end of may (or at least 2 weeks after you said it existed). That patch has very little resemblance to what you just posted conceptwise or codewise. I could post it if you would give me permission. *after* which you told that you have your own code and presentations (which I had never seen earlier - I joined netdev a few months back, earlier I was working on RDMA, Infiniband as you know). I am gonna assume you didnt know of my work - which i have been making public for about 3 years. Infact i talked about this topic when i visited your office in 2006 on a day you were not present, so it is plausible you didnt hear of it. And it didn't give me any great ideas either, remember I had posted results for E1000 at the time of sending the proposals. In mid-June you sent me a series of patches which included anything from changing variable names to combining qdisc_restart and about everything i referred to as being cosmetic differences in your posted patches. I took two of those and incorporated them in. One was an XXX in my code already to allocate the dev-blist (Commit: bb4464c5f67e2a69ffb233fcf07aede8657e4f63). The other one was a mechanical removal of the blist being passed (Commit: 0e9959e5ee6f6d46747c97ca8edc91b3eefa0757). Some of the others i asked you to defer. For example, the reason i gave you for not merging any qdisc_restart_combine changes is because i was waiting for Dave to swallow the qdisc_restart changes i made; otherwise maintainance becomes extremely painful for me. Sridhar actually provided a lot more valuable comments and fixes but has not planted a flag on behalf of the queen of spain like you did. However I do give credit in my proposal to you for what ideas that your provided (without actual code), and the same I did for other people who did the same, like Dave, Sridhar. BTW, you too had discussions with me, and I sent some patches to improve your code too, I incorporated two of your patches and asked for deferal of others. These patches have now shown up in what you claim as the difference. I just call them cosmetic difference not to downplay the importance of having an ethtool interface but because they do not make batching perform any better. The real differences are those two items. I am suprised you havent cannibalized those changes as well. I thought you renamed them to something else; according to your posting: This patch will work with drivers updated by Jamal, Matt Michael Chan with minor modifications - rename xmit_win to xmit_slots rename batch handler. Or maybe thats a future plan you have in mind? so it looks like a two way street to me (and that is how open source works and should). Open source is a lot more transparent than that. You posted a question, which was part of your research. I responded and told you i have patches; you asked me for them and i promptly ported them from pre-2.6.18 to the latest kernel at the time. The nature of this batching work is one of performance. So numbers are important. If you had some strong disagreements on something in the architecture, then it would be of great value to explain it in a technical detail - and more importantly to provide some numbers to say why it is a bad idea. You get numbers by running some tests. You did none of the above. Your
[ofa-general] Re: [PATCH 02/12 -Rev2] Changes to netdevice.h
Hi Patrick, Krishna Kumar2/India/IBM wrote on 07/23/2007 08:27:53 AM: Hi Patrick, Patrick McHardy [EMAIL PROTECTED] wrote on 07/22/2007 10:36:51 PM: Krishna Kumar wrote: @@ -472,6 +474,9 @@ struct net_device void *priv; /* pointer to private data */ int (*hard_start_xmit) (struct sk_buff *skb, struct net_device *dev); + int (*hard_start_xmit_batch) (struct net_device + *dev); + Os this function really needed? Can't you just call hard_start_xmit with a NULL skb and have the driver use dev-blist? Probably not. I will see how to do it this way and get back to you. I think this is a good idea and makes code everywhere simpler. I will try this change and test to make sure it doesn't have any negative impact. Will mostly send out rev3 tomorrow. Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 02/10] Networking include file changes.
Hi Sridhar, Sridhar Samudrala [EMAIL PROTECTED] wrote on 07/23/2007 11:29:39 AM: Krishna Kumar2 wrote: Hi Sridhar, Sridhar Samudrala [EMAIL PROTECTED] wrote on 07/20/2007 10:55:05 PM: diff -ruNp org/include/net/pkt_sched.h new/include/net/pkt_sched.h --- org/include/net/pkt_sched.h 2007-07-20 07:49:28.0 +0530 +++ new/include/net/pkt_sched.h 2007-07-20 08:30:22.0 +0530 @@ -80,13 +80,13 @@ extern struct qdisc_rate_table *qdisc_ge struct rtattr *tab); extern void qdisc_put_rtab(struct qdisc_rate_table *tab); -extern void __qdisc_run(struct net_device *dev); +extern void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist); Why do we need this additional 'blist' argument? Is this different from dev-skb_blist? It is the same, but I want to call it mostly with NULL and rarely with the batch list pointer (so it is related to your other question). My original code didn't have this and was trying batching in all cases. But in most xmit's (probably almost all), there will be only one packet in the queue to send and batching will never happen. When there is a lock contention or if the queue is stopped, then the next iteration will find 1 packets. But I still will try no batching for the lock failure case as there be probably 2 packets (one from previous time and 1 from this time, or 3 if two failures, etc), and try batching only when queue was stopped from net_tx_action (this was based on Dave Miller's idea). Is this right to say that the above change is to get this behavior? If qdisc_run() is called from dev_queue_xmit() don't use batching. If qdisc_run() is called from net_tx_action(), do batching. Correct. Isn't it possible to have multiple skb's in the qdisc queue in the first case? It is possible but rarer (so unnecessary checking most of the time). From net_tx_action you are guaranteed to have multiple skbs, but from xmit you will almost always get one skb (since most send of 1 skb will go out OK). And also in the xmit path, it is more likely to have few skbs compared to possibly hundreds in the net_tx_action path. If this additional argument is used to indicate if we should do batching or not, then passing a flag may be much more cleaner than passing the blist. OK, I will add this as another action item to check (along with Patrick's suggestion to use single API) and will get back. - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 00/12 -Rev2] Implement batching skb API
I have started a 10 run test for various buffer sizes and processes, and will post the results on Monday. The 10 iteration run results for Rev2 are (average) : -- Test Case Org New%Change -- TCP 1 Process Size:32 2703306313.31 Size:128 12948 12217 -5.64 Size:512 48108 55384 15.12 Size:4096 129089132586 2.70 Average:192848 203250 5.39 TCP 4 Processes Size:32 10389 107683.64 Size:12839694 42265 6.47 Size:512159563 156373 -1.99 Size:4096 268094 256008-4.50 Average:477740 465414 -2.58 TCP No Delay 1 Process Size:32 2606 295013.20 Size:128 8115 11864 46.19 Size:512 39113 42608 8.93 Size:4096 103966105333 1.31 Average:153800 162755 5.82 TCP No Delay 4 Processes Size:32 42138727107.14 Size:128 17579 35143 99.91 Size:512 70803 123936 75.04 Size:4096 203541 225259 10.67 Average: 296136 393065 32.73 -- Average:11205241224484 9.28% There are three cases that degrade a little (upto -5.6%), but there are 13 cases that improve, and many of those are in the 13% to over 100% (7 cases). Thanks, - KK Krishna Kumar2/India/[EMAIL PROTECTED] wrote on 07/22/2007 02:34:57 PM: This set of patches implements the batching API, and makes the following changes resulting from the review of the first set: Changes : - 1. Changed skb_blist from pointer to static as it saves only 12 bytes (i386), but bloats the code. 2. Removed requirement for driver to set features NETIF_F_BATCH_SKBS in register_netdev to enable batching as it is redundant. Changed this flag to NETIF_F_BATCH_ON and it is set by register_netdev, and other user changable calls can modify this bit to enable/disable batching. 3. Added ethtool support to enable/disable batching (not tested). 4. Added rtnetlink support to enable/disable batching (not tested). 5. Removed MIN_QUEUE_LEN_BATCH for batching as high performance drivers should not have a small queue anyway (adding bloat). 6. skbs are purged from dev_deactivate instead of from unregister_netdev to drop all references to the device. 7. Removed changelog in source code in sch_generic.c, and unrelated renames from sch_generic.c (lockless, comments). 8. Removed xmit_slots entirely, as it was adding bloat (code and header) and not adding value (it is calculated and set twice in internal send routine and handle work completion, and referenced once in batch xmit; and can instead be calculated once in xmit). Issues : 1. Remove /sysfs support completely ? 2. Whether rtnetlink support is required as GSO has only ethtool ? Patches are described as: Mail 0/12 : This mail. Mail 1/12 : HOWTO documentation. Mail 2/12 : Changes to netdevice.h Mail 3/12 : dev.c changes. Mail 4/12 : Ethtool changes. Mail 5/12 : sysfs changes. Mail 6/12 : rtnetlink changes. Mail 7/12 : Change in qdisc_run qdisc_restart API, modify callers to use this API. Mail 8/12 : IPoIB include file changes. Mail 9/12 : IPoIB verbs changes Mail 10/12 : IPoIB multicast, CM changes Mail 11/12 : IPoIB xmit API addition Mail 12/12 : IPoIB xmit internals changes (ipoib_ib.c) I have started a 10 run test for various buffer sizes and processes, and will post the results on Monday. Please review and provide feedback/ideas; and consider for inclusion. Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 03/12 -Rev2] dev.c changes.
Hi Evgeniy, Evgeniy Polyakov [EMAIL PROTECTED] wrote on 07/23/2007 04:14:28 PM: +/* + * dev_change_tx_batching - Enable or disable batching for a driver that + * supports batching. + /* Check if new value is same as the current */ + if (!!(dev-features NETIF_F_BATCH_ON) == !!new_batch_skb) + goto out; o_O Scratched head for too long before understood what it means :) Is there a easy way to do this ? + spin_lock(dev-queue_lock); + if (new_batch_skb) { + dev-features |= NETIF_F_BATCH_ON; + dev-tx_queue_len = 1; + } else { + if (!skb_queue_empty(dev-skb_blist)) + skb_queue_purge(dev-skb_blist); + dev-features = ~NETIF_F_BATCH_ON; + dev-tx_queue_len = 1; + } + spin_unlock(dev-queue_lock); Hmm, should this also stop interrupts? That is a good question, and I am not sure. I thought it is not required, though adding it doesn't affect code either. Can someone tell if disabling bh is required and why (couldn't figure out the intention of bh for dev_queue_xmit either, is this to disable preemption) ? Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 00/10] Implement batching skb API
Hi Jamal, J Hadi Salim [EMAIL PROTECTED] wrote on 07/23/2007 06:02:01 PM: Yes, and these results were sent to you as well a while back. When i get the time when i get back i will look em up in my test machine and resend. Actually you have not sent netperf results with prep and without prep. No. I see value only in non-LLTX drivers which also gets the same TX lock in the RX path. So _which_ non-LLTX driver doesnt do that? ;- I have no idea since I haven't looked at all drivers. Can you tell which all non-LLTX drivers does that ? I stated this as the sole criterea. tun driver doesnt use it either - but i doubt that makes it bloat Adding extra code that is currently not usable (esp from a submission point) is bloat. You waltz in, have the luxury of looking at my code, presentations, many discussions with me etc ... luxury ? I had implemented the entire thing even before knowing that you are working on something similar! and I had sent the first proposal to netdev, *after* which you told that you have your own code and presentations (which I had never seen earlier - I joined netdev a few months back, earlier I was working on RDMA, Infiniband as you know). And it didn't give me any great ideas either, remember I had posted results for E1000 at the time of sending the proposals. However I do give credit in my proposal to you for what ideas that your provided (without actual code), and the same I did for other people who did the same, like Dave, Sridhar. BTW, you too had discussions with me, and I sent some patches to improve your code too, so it looks like a two way street to me (and that is how open source works and should). When i ask for differences to code you produced, they now seem to sum up to the two below. You dont think theres some honest issue with this picture? Two changes ? That's it ? I gave a big list of changes between our implementations but you twist my words to conclude there is just two (by conveniently labelling everything else cosmetic, or potentially useful!)! Even my restart routine used a single API from the first day, I would never imagine using multiple API's. Our codes probably doesn't have even one line that look remotely similar! To clarify : I suggested that you could send patches for the two *missing* items if you can show they add value (and not the rest, as I consider those will not improve the code/logic/algo). (lacking in frankness, candor, or sincerity; falsely or hypocritically ingenuous; insincere) Sorry, no response to personal comments and have a flame-war :) Give me a better description. Sorry, no personal comments. Infact I will avoid responding to baits and innuendoes from now on. Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 00/10] Implement batching skb API
Hi Jamal, J Hadi Salim [EMAIL PROTECTED] wrote on 07/21/2007 06:48:41 PM: - Use a single qdisc interface to avoid code duplication and reduce maintainability (sch_generic.c size reduces by ~9%). - Has per device configurable parameter to turn on/off batching. - qdisc_restart gets slightly modified while looking simple without any checks for batching vs regular code (infact only two lines have changed - 1. instead of dev_dequeue_skb, a new batch-aware function is called; and 2. an extra call to hard_start_xmit_batch. - No change in__qdisc_run other than a new argument (from DM's idea). - Applies to latest net-2.6.23 compared to 2.6.22-rc4 code. All the above are cosmetic differences. To me is the highest priority is making sure that batching is useful and what the limitations are. At some point, when all looks good - i dont mind adding an ethtool interface to turn off/on batching, merge with the new qdisc restart path instead of having a parallel path, solicit feedback on naming, where to allocate structs etc etc. All that is low prio if batching across a variety of hardware and applications doesnt prove useful. At the moment, i am unsure theres consistency to justify push batching in. Batching need not be useful for every hardware. If there is hardware that is useful to exploit batching (like clearly IPoIB is a good candidate as both the TX and the TX completion path can handle multiple skb processing, and I haven't looked at other drivers to see if any of them can do something similar), then IMHO it makes sense to enable batching for that hardware. It is upto the other drivers to determine whether converting to the batching API makes sense or not. And as indicated, the total size increase for adding the kernel support is also insignificant - 0.03%, or 1164 Bytes (using the 'size' command). Having said that below are the main architectural differences we have which is what we really need to discuss and see what proves useful: - Batching algo/processing is different (eg. if qdisc_restart() finds one skb in the batch list, it will try to batch more (upto a limit) instead of sending that out and batching the rest in the next call. This sounds a little more aggressive but maybe useful. I have experimented with setting upper bound limits (current patches have a pktgen interface to set the max to send) and have concluded that it is unneeded. Probing by letting the driver tell you what space is available has proven to be the best approach. I have been meaning to remove the code in pktgen which allows these limits. I don't quite agree with that approach, eg, if the blist is empty and the driver tells there is space for one packet, you will add one packet and the driver sends it out and the device is stopped (with potentially lot of skbs on dev-q). Then no packets are added till the queue is enabled, at which time a flood of skbs will be processed increasing latency and holding lock for a single longer duration. My approach will mitigate holding lock for longer times and instead send skbs to the device as long as we are within the limits. Infact in my rev2 patch (being today or tomorrow after handling Patrick's and Stephen's comments), I am even removing the driver specific xmit_slots as I find it is adding bloat and requires more cycles than calculating the value each time xmit is done (ofcourse in your approach it is required since the stack uses it). - Jamal's code has a separate hw prep handler called from the stack, and results are accessed in driver during xmit later. I have explained the reasoning to this a few times. A recent response to Michael Chan is here: http://marc.info/?l=linux-netdevm=118346921316657w=2 Since E1000 doesn't seem to use the TX lock on RX (atleast I couldn't find it), I feel having prep will not help as no other cpu can execute the queue/xmit code anyway (E1000 is also a LLTX driver). Other driver that hold tx lock could get improvement however. And heres a response to you that i havent heard back on: http://marc.info/?l=linux-netdevm=118355539503924w=2 That is because it answered my query :) It is what I was expecting, but thanks for the explanation. My tests so far indicate this interface is useful. It doesnt apply well I wonder if you tried enabling/disabling 'prep' on E1000 to see how the performance is affected. If it helps, I guess you could send me a patch to add that and I can also test it to see what the effect is. I didn't add it since IPoIB wouldn't be able to exploit it (unless someone is kind enough to show me how to). So if i was to sum up this, (it would be useful discussion to have on these) the real difference is: a) you have an extra check on refilling the skb list when you find that it has a single skb. I tagged this as being potentially useful. It is very useful since extra processing is not required for one skb case -
[ofa-general] Re: [PATCH 11/12 -Rev2] IPoIB xmit API addition
Hi Micheal, Michael S. Tsirkin [EMAIL PROTECTED] wrote on 07/22/2007 03:11:36 PM: + /* +* Handle skbs completion from tx_tail to wr_id. It is possible to +* handle WC's from earlier post_sends (possible multiple) in this +* iteration as we move from tx_tail to wr_id, since if the last +* WR (which is the one which had a completion request) failed to be +* sent for any of those earlier request(s), no completion +* notification is generated for successful WR's of those earlier +* request(s). +*/ AFAIK a signalled WR will always generate a completion. What am I missing? Yes, signalled WR will generate a completion. I am trying to catch the case where, say, I send 64 skbs and set signalling for only the last skb and the others are set to NO signalling. Now if the driver found the last WR was bad for some reason, it will synchronously fail the send for that WR (which happens to be the only one that is signalled). So after the 1 to 63 skbs are finished, there will be no completion called. That was my understanding of how this works, and coded it that way so that the next post will clean up the previous one's completion. + /* + * Better error handling can be done here, like free + * all untried skbs if err == -ENOMEM. However at this + * time, we re-try all the skbs, all of which will + * likely fail anyway (unless device finished sending + * some out in the meantime). This is not a regression + * since the earlier code is not doing this either. + */ Are you retrying posting skbs? Why is this a good idea? AFAIK, earlier code did not retry posting WRs at all. Not exactly. If I send 64 skbs to the device and the provider returned a bad WR at skb # 50, then I will have to try skb# 51-64 again since the provider has not attemped to send those out as it bails out at the first failure. The provider ofcourse has already sent out skb# 1-49 before returning failure at skb# 50. So it is not strictly retry, just xmit of next skbs which is what the current code also does. I tested this part out by simulating errors in mthca_post_send and verified that the next iteration clears up the remaining skbs. The comment seems to imply that post send fails as a result of SQ overflow - Correct. do you see SQ overflow errors in your testing? No. AFAIK, IPoIB should never overflow the SQ. Correct. It should never happen unless IPoIB has a bug :) I guess the comment should be removed ? Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 06/12 -Rev2] rtnetlink changes.
Hi Patrick, Patrick McHardy [EMAIL PROTECTED] wrote on 07/22/2007 10:40:37 PM: Krishna Kumar wrote: diff -ruNp org/include/linux/if_link.h rev2/include/linux/if_link.h --- org/include/linux/if_link.h 2007-07-20 16:33:35.0 +0530 +++ rev2/include/linux/if_link.h 2007-07-20 16:35:08.0 +0530 @@ -78,6 +78,8 @@ enum IFLA_LINKMODE, IFLA_LINKINFO, #define IFLA_LINKINFO IFLA_LINKINFO + IFLA_TXBTHSKB, /* Driver support for Batch'd skbs */ +#define IFLA_TXBTHSKB IFLA_TXBTHSKB Ughh what a name :) I prefer pronouncable names since they are much easier to remember and don't need comments explaining what they mean. But I actually think offering just an ethtool interface would be better, at least for now. Great, I will remove /sys and rtnetlink and keep the Ethtool i/f. Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 02/12 -Rev2] Changes to netdevice.h
Hi Patrick, Patrick McHardy [EMAIL PROTECTED] wrote on 07/22/2007 10:36:51 PM: Krishna Kumar wrote: @@ -472,6 +474,9 @@ struct net_device void *priv; /* pointer to private data */ int (*hard_start_xmit) (struct sk_buff *skb, struct net_device *dev); + int (*hard_start_xmit_batch) (struct net_device + *dev); + Os this function really needed? Can't you just call hard_start_xmit with a NULL skb and have the driver use dev-blist? Probably not. I will see how to do it this way and get back to you. /* These may be needed for future network-power-down code. */ unsigned long trans_start; /* Time (in jiffies) of last Tx */ @@ -582,6 +587,8 @@ struct net_device #define NETDEV_ALIGN 32 #define NETDEV_ALIGN_CONST (NETDEV_ALIGN - 1) +#define BATCHING_ON(dev) ((dev-features NETIF_F_BATCH_ON) != 0) + static inline void *netdev_priv(const struct net_device *dev) { return dev-priv; @@ -832,6 +839,8 @@ extern int dev_set_mac_address(struct n struct sockaddr *); extern int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev); +extern int dev_add_skb_to_blist(struct sk_buff *skb, +struct net_device *dev); Again, function signatures should be introduced in the same patch that contains the function. Splitting by file doesn't make sense. Right. I did it for some but missed this. Sorry, will redo. thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 00/10] Implement batching skb API
Hi Jamal, J Hadi Salim [EMAIL PROTECTED] wrote on 07/22/2007 06:21:09 PM: My concern is there is no consistency in results. I see improvements on something which you say dont. You see improvement in something that Evgeniy doesnt etc. Hmmm ? Evgeniy has not even tested my code to find some regression :) And you may possibly not find much improvement in E1000 when you run iperf (which is what I do) compared to pktgen. I can re-run and confirm this since my last E1000 run was quite some time back. My point is that batching not being viable for E1000 (or tg3) need not be the sole criterea for inclusion. If IPoIB or other drivers can take advantage of it and get better results, then batching can be considered. Maybe E1000 too can get improvements if some one with more expertise tries to add this API (not judging your driver writing capabilities - just stating that driver writers will know more knobs to exploit a complex device like E1000). Since E1000 doesn't seem to use the TX lock on RX (atleast I couldn't find it), I feel having prep will not help as no other cpu can execute the queue/xmit code anyway (E1000 is also a LLTX driver). My experiments show it is useful (in a very visible way using pktgen) for e1000 to have the prep() interface. I meant : have you compared results of batching with prep on vs prep off, and what is the difference in BW ? Other driver that hold tx lock could get improvement however. So you do see the value then with non LLTX drivers, right? ;- No. I see value only in non-LLTX drivers which also gets the same TX lock in the RX path. If different locks are got by TX/RX, then since you are holding queue_lock before calling 'prep', this excludes other TX from running at the same time. In that case, pre-poning the get of the tx_lock to do the 'prep' will not cause any degradation (since no other tx can run anyway, while rx can run as it gets a different lock). The value is also there in LLTX drivers even if in just formating a skb ready for transmit. If this is not clear i could do a much longer writeup on my thought evolution towards adding prep(). In LLTX drivers, the driver does the 'prep' without holding the tx_lock in any case, so there should be no improvement. Could you send the write-up since I really don't see the value in prep unless the driver is non-LLTX *and* TX/RX holds the same TX lock. I think that is the sole criterea, right ? If it helps, I guess you could send me a patch to add that and I can also test it to see what the effect is. I didn't add it since IPoIB wouldn't be able to exploit it (unless someone is kind enough to show me how to). Such core code should not just be focussed on IPOIB. There is *nothing* IPoIB specific or focus in my code. I said adding prep doesn't work for IPoIB and so it is pointless to add bloat to the code until some code can actually take advantage of this feature (I am sure you will agree). Which is why I also mentioned to please send me a patch if you find it useful for any driver rather than rejecting this idea. I think the code I have is ready and stable, I am not sure how to intepret that - are you saying all-is-good and we should just push your code in? I am only too well aware that Dave will not accept any code (having experienced with Mobile IPv6 a long time back when he said to move most of it to userspace and he was absolutely correct :). What I meant to say is that there isn't much point in saying that your code is not ready or you are using old code base, or has multiple restart functions, or is not tested enough, etc, and then say let's re-do/rethink the whole implementation when my code is already working and giving good results. Unless you have some design issues with it, or code is written badly, is not maintainable, not linux style compliant, is buggy, will not handle some case/workload, type of issues. OTOH, if you find some cases that are better handled with : 1. prep handler 2. xmit_win (which I don't have now), then please send me patches and I will also test out and incorporate. It sounds disingenuous but i may have misread you. (lacking in frankness, candor, or sincerity; falsely or hypocritically ingenuous; insincere) Sorry, no response to personal comments and have a flame-war :) Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 04/10] net-sysfs.c changes.
Stephen Hemminger [EMAIL PROTECTED] wrote on 07/20/2007 09:52:03 PM: Patrick McHardy [EMAIL PROTECTED] wrote: Krishna Kumar2 wrote: Patrick McHardy [EMAIL PROTECTED] wrote on 07/20/2007 03:37:20 PM: rtnetlink support seems more important than sysfs to me. Thanks, I will add that as a patch. The reason to add to sysfs is that it is easier to change for a user (and similar to tx_queue_len). But since batching is so similar to TSO, i really should be part of the flags and controlled by ethtool like other offload flags. So should I add all three interfaces (or which ones) : 1. /sys (like for tx_queue_len) 2. netlink 3. ethtool. Or only 2 3 are enough ? thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 03/10] dev.c changes.
Hi Sridhar, Sridhar Samudrala [EMAIL PROTECTED] wrote on 07/20/2007 11:14:19 PM: @@ -1566,7 +1605,7 @@ gso: /* reset queue_mapping to zero */ skb-queue_mapping = 0; rc = q-enqueue(skb, q); - qdisc_run(dev); + qdisc_run(dev, NULL); OK. So you are passing a NULL blist here. However, i am not sure why batching is not used in this situation. Actually it could be used, but in most cases there will be only one skb. If I pass the blist here, the result (for batching case) will be to put one single skb into the blist and call the new xmit API. That wastes cycles as we take a skb out from the queue (as in regular code) and then add it to the blist (different in the new code) and then the driver has to remove this skb from the blist (different in the new code). I could try batching but then require there are more than 1 skbs before adding to the blist (or the blist doesn't already have skbs, in which case adding even one skb makes sense). Also, it will have a slight impact for regular drivers where for each xmit, one extra dereference for dev-skb_blist (which is always NULL) is made, which was another reason to always pass NULL. I will check what the results are by giving passing blist here too and make the above change. I will run tests for that (as well as NETPERF RR test as asked by Evgeniy). Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 05/10] sch_generic.c changes.
Hi Patrick, Patrick McHardy [EMAIL PROTECTED] wrote on 07/20/2007 11:46:36 PM: Krishna Kumar wrote: +static inline int get_skb(struct net_device *dev, struct Qdisc *q, + struct sk_buff_head *blist, + struct sk_buff **skbp) +{ + if (likely(!blist) || (!skb_queue_len(blist) qdisc_qlen(q) = 1)) { + return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL); + } else { + int max = dev-tx_queue_len - skb_queue_len(blist); I'm assuming the driver will simply leave excess packets in the blist for the next run. Yes, and the next run will be scheduled even if no more xmits are called either due to qdisc_restart()'s call to driver returning : BUSY : driver failed to send all, net_tx_action will handle this later (the case you mentioned) OK : and qlen is 0, return 1 and __qdisc_run() will re-retry (where blist len will become zero as driver processed EVERYTHING on blist) The check for tx_queue_len is wrong though, its only a default which can be overriden and some qdiscs don't care for it at all. I think it should not matter whether qdiscs use this or not, or even if it is modified (unless it is made zero in which case this breaks). The intention behind this check is to make sure that not more than tx_queue_len skbs are in all queues put together (q-qdisc + dev-skb_blist), otherwise the blist can become too large and breaks the idea of tx_queue_len. Is that a good justification ? -void __qdisc_run(struct net_device *dev) +void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist) And the patches should really be restructured so this change is in the same patch changing the header and the caller, for example. Ah, OK. Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 02/10] Networking include file changes.
Hi Sridhar, Sridhar Samudrala [EMAIL PROTECTED] wrote on 07/20/2007 10:55:05 PM: diff -ruNp org/include/net/pkt_sched.h new/include/net/pkt_sched.h --- org/include/net/pkt_sched.h 2007-07-20 07:49:28.0 +0530 +++ new/include/net/pkt_sched.h 2007-07-20 08:30:22.0 +0530 @@ -80,13 +80,13 @@ extern struct qdisc_rate_table *qdisc_ge struct rtattr *tab); extern void qdisc_put_rtab(struct qdisc_rate_table *tab); -extern void __qdisc_run(struct net_device *dev); +extern void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist); Why do we need this additional 'blist' argument? Is this different from dev-skb_blist? It is the same, but I want to call it mostly with NULL and rarely with the batch list pointer (so it is related to your other question). My original code didn't have this and was trying batching in all cases. But in most xmit's (probably almost all), there will be only one packet in the queue to send and batching will never happen. When there is a lock contention or if the queue is stopped, then the next iteration will find 1 packets. But I still will try no batching for the lock failure case as there be probably 2 packets (one from previous time and 1 from this time, or 3 if two failures, etc), and try batching only when queue was stopped from net_tx_action (this was based on Dave Miller's idea). Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Results Scripts for : [PATCH 00/10] Implement batching skb API
Attached file contains scripts for running tests and parsing results : (See attached file: scripts.tar) The result of a 10 run (average) TCP iperf (and 1 netperf for UDP) is given below. Thanks, - KK - Test configuration : Single cross-over cable for MTHCA cards (MT23108) on two PPC64 systems, both systems are 8-CPU P5 1.5 GHz processors with 8HB memory. A. TCP results for a 10 run average are as follows (using iperf, could not run netperf in parallel as it is not synchronized): First number : Orig BW in KB/s. Second number : New BW in KB/s. Third number : Percentage change. IPoIB was configured with 512 sendq size while default configuration (128) gave positives for most test cases but more negatives for 512 and 4K buffer sizes. Buffer Size 32 TCP Threads:1 : 31263169 1.4 TCP Threads:4 : 973910889 11.8 TCP Threads:16 : 35383 47218 33.4 TCP Threads:64 : 85147 84196 -1.1 Average : 9.05% TCP No Delay: Threads:1 : 1990 297649.5 TCP No Delay: Threads:4 : 8137 87707.7 TCP No Delay: Threads:16 : 3171437308 17.63 TCP No Delay: Threads:64 : 7283081892 12.44 Average : 14.19% Buffer Size 128 TCP Threads:1 : 12674 13339 5.2 TCP Threads:4 : 37889 40816 7.7 TCP Threads:16 : 141342 165935 17.3 TCP Threads:64 : 199813 196283 -1.7 Average : 6.29% TCP No Delay: Threads:1 : 7732 11272 45.7 TCP No Delay: Threads:4 : 33348 35222 5.6 TCP No Delay: Threads:16 : 120507 143960 19.5 TCP No Delay: Threads:64 : 195459 193875 -0.8 Average : 7.64% Buffer Size 512 TCP Threads:1 : 42256 55735 31.9 TCP Threads:4 : 161237 161777 0.3 TCP Threads:16 : 227911 231781 1.7 TCP Threads:64 : 229779 223152 -2.9 Average : 1.70% TCP No Delay: Threads:1 : 30065 42500 41.3 TCP No Delay: Threads:4 : 79076 12584859.1 TCP No Delay: Threads:16 : 225725 224155 -0.7 TCP No Delay: Threads:64 : 231220 223664 -3.26 Average : 8.84% Buffer Size 4096 TCP Threads:1 : 119364 135445 13.5 TCP Threads:4 : 261301 256754 -1.7 TCP Threads:16 : 246889 247065 0.07 TCP Threads:64 : 237613 234185 -1.4 Average : 0.95% TCP No Delay: Threads:1 : 102187104087 1.9 TCP No Delay: Threads:4 : 204139243169 19.1 TCP No Delay: Threads:16 : 245529 242519 -1.2 TCP No Delay: Threads:64 : 236826 233382 -1.4 Average : 4.37% - B. Using netperf to run 1 process UDP (1 run, measured with 128 sendq size, will be re-doing with 512 sendq and for 10 runs average) : -- OrgNew Perc BWService BW ServiceBW Service -- 6.40 1277.64 6.501272.41 1.56 -.40 24.80 663.01 25.80 318.134.03 -52.01 101.8081.02 101.90 80.63 .09 -.48 395.7020.77 395.90 20.74 .05 -.14 1172.90 7.001156.80 7.10 -1.371.42 - scripts.tar Description: Binary data ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 00/10] Implement batching skb API
Stephen Hemminger [EMAIL PROTECTED] wrote on 07/20/2007 12:48:48 PM: You may see worse performance with batching in the real world when running over WAN's. Like TSO, batching will generate back to back packet trains that are subject to multi-packet synchronized loss. The problem is that intermediate router queues are often close to full, and when a long string of packets arrives back to back only the first ones will get in, the rest get dropped. Normal sends have at least minimal pacing so they are less likely do get synchronized drop. Hi Stephen, OK. The difference that I could see is that in existing code, the minimal pacing also could lead to (possibly slighly lesser) loss since sends are quick iterations at the IP layer, while in batching sends are iterative at the driver layer. Is it an issue ? Any suggestions ? Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 00/10] Implement batching skb API
Stephen Hemminger [EMAIL PROTECTED] wrote on 07/20/2007 12:48:48 PM: You may see worse performance with batching in the real world when running over WAN's. Like TSO, batching will generate back to back packet trains that are subject to multi-packet synchronized loss. The problem is that intermediate router queues are often close to full, and when a long string of packets arrives back to back only the first ones will get in, the rest get dropped. Normal sends have at least minimal pacing so they are less likely do get synchronized drop. Also forgot to mention in the previous mail, if performance is seen to be dipping, batching can be disabled on WAN's by: echo 0 /sys/class/net/dev/tx_batch_skbs and use batching on local/site networks in that case. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 05/10] sch_generic.c changes.
Patrick McHardy [EMAIL PROTECTED] wrote on 07/20/2007 03:41:01 PM: Krishna Kumar wrote: diff -ruNp org/net/sched/sch_generic.c new/net/sched/sch_generic.c --- org/net/sched/sch_generic.c 2007-07-20 07:49:28.0 +0530 +++ new/net/sched/sch_generic.c 2007-07-20 08:30:22.0 +0530 @@ -9,6 +9,11 @@ * Authors: Alexey Kuznetsov, [EMAIL PROTECTED] * Jamal Hadi Salim, [EMAIL PROTECTED] 990601 * - Ingress support + * + * New functionality: + * Krishna Kumar, [EMAIL PROTECTED], July 2007 + * - Support for sending multiple skbs to devices that support + *new api - dev-hard_start_xmit_batch() No new changelogs in source code please, git keeps track of that. Ah, didn't know this, thanks for letting me know. -static inline int qdisc_restart(struct net_device *dev) +static inline int qdisc_restart(struct net_device *dev, +struct sk_buff_head *blist) { struct Qdisc *q = dev-qdisc; struct sk_buff *skb; - unsigned lockless; + unsigned getlock; /* whether we need to get lock or not */ Unrelated rename, please get rid of this to reduce the noise. OK, I guess I should have sent that change earlier :) The reason to change the name is to avoid (double-negative) checks like : if (!lockless) to if (getlock). I will remove these changes. thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 03/10] dev.c changes.
Hi Patrick, Thanks for your comments. Patrick McHardy [EMAIL PROTECTED] wrote on 07/20/2007 03:34:30 PM: The queue length can be changed through multiple interfaces, if that really is important you need to catch these cases too. I have a TODO comment in net-sysfs.c which is to catch this case. + } else { + dev-skb_blist = kmalloc(sizeof *dev-skb_blist, + GFP_KERNEL); Why not simply put the head in struct net_device? It seems to me that this could also be used for gso_skb. Without going into GSO, it is wasting some 32 bytes on i386 since most drivers don't export this API. Queue purging should be done in dev_deactivate. I originally had it in dev_deactivate, but when I did a ifdown eth0, ifup eth0, the system panic'd. The first solution I thought was to initialize the skb_blist in dev_change_flags() rather than in register_netdev(), but then felt that a series of ifup/ifdown will unnecessarily check stuff/malloc/free/initialize stuff, and so thought of putting it in unregister_netdev (where it is balanced with register_netdev). Is there any reason to move this ? Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 04/10] net-sysfs.c changes.
Patrick McHardy [EMAIL PROTECTED] wrote on 07/20/2007 03:37:20 PM: Krishna Kumar wrote: Support to turn on/off batching from /sys. rtnetlink support seems more important than sysfs to me. Thanks, I will add that as a patch. The reason to add to sysfs is that it is easier to change for a user (and similar to tx_queue_len). - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 03/10] dev.c changes.
Hi Patrick, Patrick McHardy [EMAIL PROTECTED] wrote on 07/20/2007 04:50:37 PM: I have a TODO comment in net-sysfs.c which is to catch this case. I noticed that. Still wondering why it is important at all though. I saw another mail of yours on the marc list on this same topic (which still hasn't come to me in the mail), so I will answer both : Is there any downside in using batching with smaller queue sizes? I think there is, but as yet I don't have any data (and 16 is probably higher than reqd) to show it. If the queue size is very small (like 4), the extra processing to maintain this list may take more cycles than the performance gains for sending out few skbs, esp since most xmits will send out 1 skb and skb batching takes places less often (when tx lock fails or queue gets full). OTOH, there might be a gain to even send out 2 skbs, the problem is in doing the extra processing before xmit and not at the time of xmit. Does this sound OK ? If so, I will add the code to implement the TODO for tx_queue_len checking too. Without going into GSO, it is wasting some 32 bytes on i386 since most drivers don't export this API. 32 bytes? I count 16, - 4 for the pointer, so its 12 bytes of waste. If you'd use it for gso_skb it would come down to 8 bytes. struct net_device is a pig already, and there are better ways to reduce this than starting to allocating single members with a few bytes IMO. Sorry, I wanted to say 12 bytes on 32 bit system but mixed it up and said 32 bytes. So I guess static allocation is better then, and it will also help in performance as memory access is not required (offsetof should work). Yes, packets can be holding references to various stuff and these should be released on device down. As I said above I don't really like the allocation, but even if you want to keep it, just do the purging and dev_deactivate and keep the freeing in unregister_netdev (actually I guess it should be free_netdev to handle register_netdevice errors). Right, that makes it clean to do (and avoid stale packets on down). I will make both these changes now. Thanks for these suggestions, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 03/10] dev.c changes.
Patrick McHardy [EMAIL PROTECTED] wrote on 07/20/2007: I can't really argue about the numbers, but it seems to me that only devices which *usually* have a sufficient queue length will support this, and anyone setting the queue length of a gbit device to 16 is begging for trouble anyway. So it doesn't really seem worth to bloat the code for handling an insane configuration as long as it doesn't break. Ah, I get your point now. So if driver sets BATCHING and user then sets queue_len to (say) 4, then poor results are expected (and kernel doesn't need to try fix it). Same for driver setting BATCHING when it's queue is small in the first place, which no driver writer should do anyway. I think it makes the code a lot easier too. Will update. thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 03/10] dev.c changes.
Patrick McHardy [EMAIL PROTECTED] wrote on 07/20/2007 04:50:37 PM: 32 bytes? I count 16, - 4 for the pointer, so its 12 bytes of waste. If you'd use it for gso_skb it would come down to 8 bytes. struct net_device is a pig already, and there are better ways to reduce this than starting to allocating single members with a few bytes IMO. Currently, this allocated pointer is an indication to let kernel users (qdisc_restart, setting/resetting tx_batch_skbs) know whether batching is enabled or disabled. Removing the pointer and making it static means those users cannot figure out this information . Adding another field to netdev may be a bad idea, so I am thinking of overloading dev-features to add a new flag (other than NETIF_F_BATCH_SKBS, since that is a driver capabilities flag) which can be set/cleared based on NETIF_F_BATCH_SKBS bit. Does this approach sound OK ? Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 03/10] dev.c changes.
(My Notes crashed when I hit the Send button, so not sure if this went out). __ Patrick McHardy [EMAIL PROTECTED] wrote on 07/20/2007 04:50:37 PM: 32 bytes? I count 16, - 4 for the pointer, so its 12 bytes of waste. If you'd use it for gso_skb it would come down to 8 bytes. struct net_device is a pig already, and there are better ways to reduce this than starting to allocating single members with a few bytes IMO. Currently, this allocated pointer is an indication to let kernel users (qdisc_restart, setting/resetting tx_batch_skbs) know whether batching is enabled or disabled. Removing the pointer and making it static means those users cannot figure out this information . Adding another field to netdev may be a bad idea, so I am thinking of overloading dev-features to add a new flag (other than NETIF_F_BATCH_SKBS, since that is a driver capabilities flag) which can be set/cleared based on NETIF_F_BATCH_SKBS bit. Does this approach sound OK ? Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 00/10] Implement batching skb API
Hi Evgeniy, Evgeniy Polyakov [EMAIL PROTECTED] wrote on 07/20/2007 06:24:23 PM: After fine-tuning qdisc and other changes, I modified IPoIB to use this API, and now get good gains. Summary for TCP No Delay: 1 process improves for all cases from 1.4% to 49.5%; 4 process has almost identical improvements from -1.7% to 59.1%; 16 process case also improves in the range of -1.2% to 33.4%; while 64 process doesn't have much improvement (-3.3% to 12.4%). UDP was tested with 1 process netperf with small increase in BW but big improvement in Service Demand. Netperf latency tests show small drop in transaction rate (results in separate attachment). What about round-robin tcp time and latency test? In theory such batching mode should not change that timings, but practice can show new aspects. I will review code later this week (likely tomorrow) and if there will be some issues return back. I had run RR test quite some time back and don't have the result at this time, other than remembering it was almost the same as the original. As I am running some tests on those systems at this time, I can send the results of RR tomorrow. Thanks, - KK ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general