This set of patches implements the batching xmit capability (changed from API), and adds support for batching in IPoIB. Also included is a sample patch for E1000 (ported - thanks to Jamal's E1000 changes from earlier kernel). I will use this patch for testing E1000 TSO vs batching after the weekend.
List of changes from previous revision: ---------------------------------------- 1. [Dave/Patrick] Remove new xmit API altogether (and add a capabilities flag in dev->features). Modify documentation to remove API, etc. 2. [Evgeniy] Remove bogus checks for <0, and use spin_lock_bh. 3. [Jamal] Ported Jamal's E1000 driver changes for using batching xmit. 5. [KK] Fix out-of-order sending of skbs bug resulting in re-transmissions by a fix in IPoIB [see XXX]. 6. [KK] Do not force device to use batching as default, instead let user enable batching if required. This is useful in case users are not aware that batching is taking place. 4. [KK] IPoIB: Remove multiple xmit handlers and convert to use one. 7. [KK] IPoIB: Removed overkill - poll handler can be called on one CPU, so there is no need to take a new lock against parallel WC's. Extras that I can do later: --------------------------- 1. [Patrick] Use skb_blist statically in netdevice. This could also be used to integrate GSO and batching. 2. [Evgeniy] Useful to splice lists dev_add_skb_to_blist (and this can be done for regular xmit's of GSO skbs too for #1 above). Patches are described as: Mail 0/10: This mail Mail 1/10: HOWTO documentation Mail 2/10: Introduce skb_blist, NETIF_F_BATCH_SKBS, use single API for batching/no-batching, etc. Mail 3/10: Modify qdisc_run() to support batching Mail 4/10: Add ethtool support to enable/disable batching Mail 5/10: IPoIB: Header file changes to use batching Mail 6/10: IPoIB: CM & Multicast changes Mail 7/10: IPoIB: Verbs changes to use batching Mail 8/10: IPoIB: Internal post and work completion handler Mail 9/10: IPoIB: Implement the new batching capability Mail 10/10: E1000: Implement the new batching capability Issues: -------- I am getting a huge amount of retransmissions for both TCP and TCP No Delay cases for IPoIB (which explains the slight degradation for some test cases mentioned in previous mail). After a full test run, there were 18500 retransmissions for every 1 in regular code. But there is 20.7% overall improvement in BW even with this huge amount of retransmissions (which implies batching could improve results even more if this problem is fixed). Results of experiments are: a. With batching set to maximum 2 skbs, I get almost the same number of retransmissions (implies receiver probably is not dropping skbs). ifconfig/netstat on receiver gives no clue (drop/errors, etc). b. Making the IPoIB xmit create single work requests for each skb on blist reduces retrans to same as in regular code. c. Similar retransmission increase is not seen for E1000. Please review and provide feedback; and consider for inclusion. Thanks, - KK [XXX] Dave had suggested to use batching only in the net_tx_action case. When I implemented that in earlier revisions, there were lots of TCP retransmissions (about 18,000 to every 1 in regular code). I found the reason for part of that problem as: skbs get queue'd up in dev->qdisc (when tx lock was not got or queue blocked); when net_tx_action is called later, it passes the batch list as argument to qdisc_run and this results in skbs being moved to the batch list; then batching xmit also fails due to tx lock failure; the next many regular xmit of a single skb will go through the fast path (pass NULL batch list to qdisc_run) and send those skbs out to the device while previous skbs are cooling their heels in the batch list. The first fix was to not pass NULL/batch-list to qdisc_run() but to always check whether skbs are present in batch list when trying to xmit. This reduced retransmissions by a third (from 18,000 to around 12,000), but led to another problem while testing - iperf transmit almost zero data for higher # of parallel flows like 64 or more (and when I run iperf for a 2 min run, it takes about 5-6 mins, and reports that it ran 0 secs and the amount of data transfered is a few MB's). I don't know why this happens with this being the only change (any ideas is very appreciated). The second fix that resolved this was to revert back to Dave's suggestion to use batching only in net_tx_action case, and modify the driver to see if skbs are present in batch list and to send them out first before sending the current skb. I still see huge retransmission for IPoIB (but not for E1000), though it has reduced to 12,000 from the earlier 18,000 number. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html