[ofa-general] reduce debt with Debt Pros
Get Out of Debt Today. Avoid Bankruptcy. Save Thousands... The Professional Way!! http://blongl.cn/___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
On Tue, 2007-09-10 at 08:39 +0530, Krishna Kumar2 wrote: Driver might ask for 10 and we send 10, but LLTX driver might fail to get lock and return TX_LOCKED. I haven't seen your code in greater detail, but don't you requeue in that case too? For others drivers that are non-batching and LLTX, it is possible - at the moment in my patch i whine that the driver is buggy. I will fix this up so it checks for NETIF_F_BTX. Thanks for pointing the above use case. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCHES] TX batching
On Tue, 2007-09-10 at 13:44 +0530, Krishna Kumar2 wrote: My feeling is that since the approaches are very different, My concern is the approaches are different only for short periods of time. For example, I do requeueing, have xmit_win, have -end_xmit, do batching from core etc; if you see value in any of these concepts, they will appear in your patches and this goes on a loop. Perhaps what we need is a referee and use our energies in something more positive. it would be a good idea to test the two for performance. Which i dont mind as long as it has an analysis that goes with it. If all you post is heres what netperf showed, it is not useful at all. There are also a lot of affecting variables. For example, is the receiver a bottleneck? To make it worse, I could demonstrate to you that if i slowed down the driver and allowed more packets to queue up on the qdisc, batching will do well. In the past my feeling is you glossed over such details and i am sucker for things like that - hence the conflict. Do you mind me doing that? Ofcourse others and/or you are more than welcome to do the same. I had sent a note to you yesterday about this, please let me know either way. I responded to you - but it may have been lost in the noise; heres a copy: http://marc.info/?l=linux-netdevm=119185137124008w=2 cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCHES] TX batching rev2.5
Please provide feedback on the code and/or architecture. They are now updated to work with the latest rebased net-2.6.24 from a few hours ago. I am on travel mode so wont have time to do more testing for the next few days - i do consider this to be stable at this point based on what i have been testing (famous last words). Patch 1: Introduces batching interface Patch 2: Core uses batching interface Patch 3: get rid of dev-gso_skb What has changed since i posted last: - 1) There was some cruft leftover from prep_frame feature that i forgot to remove last time - it is now gone. 2) In the shower this AM, i realized that it is plausible that a batch of packets sent to the driver may all be dropped because they are badly formatted. Current drivers all return NETDEV_TX_OK in all such cases. This will lead for us to call dev-hard_end_xmit() to be invoked unnecessarily. I already had a NETDEV_TX_DROPPED within batching drivers, so i made that global and make the batching drivers return it if they drop packets. The core calls dev-hard_end_xmit() when at least one packet makes it through. Things i am gonna say that nobody will see (wink) - Dave please let me know if this meets your desires to allow devices which are SG and able to compute CSUM benefit just in case i misunderstood. Herbert, if you can look at at least patch 3 i will appreaciate it (since it kills dev-gso_skb that you introduced). UPCOMING PATCHES --- As before: More patches to follow later if i get some feedback - i didnt want to overload people by dumping too many patches. Most of these patches mentioned below are ready to go; some need some re-testing and others need a little porting from an earlier kernel: - tg3 driver - tun driver - pktgen - netiron driver - e1000e driver (non-LLTX) - ethtool interface - There is at least one other driver promised to me Theres also a driver-howto that i will post soon today. PERFORMANCE TESTING System under test hardware is still a 2xdual core opteron with a couple of tg3s. A test tool generates udp traffic of different sizes for upto 60 seconds per run or a total of 30M packets. I have 4 threads each running on a specific CPU which keep all the CPUs as busy as they can sending packets targetted at a directly connected box's udp discard port. All 4 CPUs target a single tg3 to send. The receiving box has a tc rule which counts and drops all incoming udp packets to discard port - this allows me to make sure that the receiver is not the bottleneck in the testing. Packet sizes sent are {8B, 32B, 64B, 128B, 256B, 512B, 1024B}. Each packet size run is repeated 10 times to ensure that there are no transients. The average of all 10 runs is then computed and collected. I do plan also to run forwarding and TCP tests in the future when the dust settles. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 1/3] [NET_BATCH] Introduce batching interface Rev2.5
This patch introduces the netdevice interface for batching. cheers, jamal [NET_BATCH] Introduce batching interface This patch introduces the netdevice interface for batching. BACKGROUND - A driver dev-hard_start_xmit() has 4 typical parts: a) packet formating (example vlan, mss, descriptor counting etc) b) chip specific formatting c) enqueueing the packet on a DMA ring d) IO operations to complete packet transmit, tell DMA engine to chew on, tx completion interupts, set last tx time, etc [For code cleanliness/readability sake, regardless of this work, one should break the dev-hard_start_xmit() into those 4 functions anyways]. INTRODUCING API --- With the api introduced in this patch, a driver which has all 4 parts and needing to support batching is advised to split its dev-hard_start_xmit() in the following manner: 1)Remove #d from dev-hard_start_xmit() and put it in dev-hard_end_xmit() method. 2)#b and #c can stay in -hard_start_xmit() (or whichever way you want to do this) 3) #a is deffered to future work to reduce confusion (since it holds on its own). Note: There are drivers which may need not support any of the two approaches (example the tun driver i patched) so the methods are optional. xmit_win variable is set by the driver to tell the core how much space it has to take on new skbs. It is introduced to ensure that when we pass the driver a list of packets it will swallow all of them - which is useful because we dont requeue to the qdisc (and avoids burning unnecessary cpu cycles or introducing any strange re-ordering). The driver tells us when it invokes netif_wake_queue how much space it has for descriptors by setting this variable. Refer to the driver howto for more details. THEORY OF OPERATION --- 1. Core dequeues from qdiscs upto dev-xmit_win packets. Fragmented and GSO packets are accounted for as well. 2. Core grabs TX_LOCK 3. Core loop for all skbs: invokes driver dev-hard_start_xmit() 4. Core invokes driver dev-hard_end_xmit() ACKNOWLEDGEMENT AND SOME HISTORY There's a lot of history and reasoning of why batching in a document i am writting which i may submit as a patch. Thomas Graf (who doesnt know this probably) gave me the impetus to start looking at this back in 2004 when he invited me to the linux conference he was organizing. Parts of what i presented in SUCON in 2004 talk about batching. Herbert Xu forced me to take a second look around 2.6.18 - refer to my netconf 2006 presentation. Krishna Kumar provided me with more motivation in May 2007 when he posted on netdev and engaged me. Sridhar Samudrala, Krishna Kumar, Matt Carlson, Michael Chan, Jeremy Ethridge, Evgeniy Polyakov, Sivakumar Subramani, David Miller, and Patrick McHardy, Jeff Garzik and Bill Fink have contributed in one or more of {bug fixes, enhancements, testing, lively discussion}. The Broadcom and neterion folks have been outstanding in their help. Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] --- commit 98d39ea7922fa2719a80eecd02cae359f3d7 tree 63822bf3040ea41846399c589c912c2be654f008 parent 7b4cd20628fe5c4e145c383fcd8d954d38f7be61 author Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:06:28 -0400 committer Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:06:28 -0400 include/linux/netdevice.h |9 +- net/core/dev.c| 67 ++--- net/sched/sch_generic.c |4 +-- 3 files changed, 73 insertions(+), 7 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 91cd3f3..b0e71c9 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -86,6 +86,7 @@ struct wireless_dev; /* Driver transmit return codes */ #define NETDEV_TX_OK 0 /* driver took care of packet */ #define NETDEV_TX_BUSY 1 /* driver tx path was busy*/ +#define NETDEV_TX_DROPPED 2 /* driver tx path dropped packet*/ #define NETDEV_TX_LOCKED -1 /* driver tx lock was already taken */ /* @@ -467,6 +468,7 @@ struct net_device #define NETIF_F_NETNS_LOCAL 8192 /* Does not change network namespaces */ #define NETIF_F_MULTI_QUEUE 16384 /* Has multiple TX/RX queues */ #define NETIF_F_LRO 32768 /* large receive offload */ +#define NETIF_F_BTX 65536 /* Capable of batch tx */ /* Segmentation offload features */ #define NETIF_F_GSO_SHIFT 16 @@ -595,6 +597,9 @@ struct net_device void *priv; /* pointer to private data */ int (*hard_start_xmit) (struct sk_buff *skb, struct net_device *dev); + void (*hard_end_xmit) (struct net_device *dev); + int xmit_win; + /* These may be needed for future network-power-down code. */ unsigned long trans_start; /* Time (in jiffies) of last Tx */ @@ -609,6 +614,7 @@ struct net_device /* delayed register/unregister */ struct list_head todo_list; + struct sk_buff_head blist; /* device index hash chain */ struct hlist_node index_hlist; @@ -1043,7 +1049,8
[ofa-general] [PATCH 2/3][NET_BATCH] Rev2.5 net core use batching
This patch adds the usage of batching within the core. cheers, jamal [NET_BATCH] net core use batching This patch adds the usage of batching within the core. Performance results demonstrating improvement are provided separately. I have #if-0ed some of the old functions so the patch is more readable. A future patch will remove all if-0ed content. Patrick McHardy eyeballed a bug that will cause re-ordering in case of a requeue. Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] --- commit c73d8ee8cce61a98f55fbfb2cafe813a7eca472c tree 8b9155fe15baa4c2e7adb69585c7aa275a6bc896 parent 98d39ea7922fa2719a80eecd02cae359f3d7 author Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:13:30 -0400 committer Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:13:30 -0400 net/sched/sch_generic.c | 104 ++- 1 files changed, 94 insertions(+), 10 deletions(-) diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 424c08b..d98c680 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -56,6 +56,7 @@ static inline int qdisc_qlen(struct Qdisc *q) return q-q.qlen; } +#if 0 static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev, struct Qdisc *q) { @@ -110,6 +111,85 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb, return ret; } +#endif + +static inline int handle_dev_cpu_collision(struct net_device *dev) +{ + if (unlikely(dev-xmit_lock_owner == smp_processor_id())) { + if (net_ratelimit()) + printk(KERN_WARNING +Dead loop on netdevice %s, fix it urgently!\n, +dev-name); + return 1; + } + __get_cpu_var(netdev_rx_stat).cpu_collision++; + return 0; +} + +static inline int +dev_requeue_skbs(struct sk_buff_head *skbs, struct net_device *dev, + struct Qdisc *q) +{ + + struct sk_buff *skb; + + while ((skb = __skb_dequeue_tail(skbs)) != NULL) + q-ops-requeue(skb, q); + + netif_schedule(dev); + return 0; +} + +static inline int +xmit_islocked(struct sk_buff_head *skbs, struct net_device *dev, + struct Qdisc *q) +{ + int ret = handle_dev_cpu_collision(dev); + + if (ret) { + if (!skb_queue_empty(skbs)) + skb_queue_purge(skbs); + return qdisc_qlen(q); + } + + return dev_requeue_skbs(skbs, dev, q); +} + +static int xmit_count_skbs(struct sk_buff *skb) +{ + int count = 0; + for (; skb; skb = skb-next) { + count += skb_shinfo(skb)-nr_frags; + count += 1; + } + return count; +} + +static int xmit_get_pkts(struct net_device *dev, + struct Qdisc *q, + struct sk_buff_head *pktlist) +{ + struct sk_buff *skb; + int count = dev-xmit_win; + + if (count dev-gso_skb) { + skb = dev-gso_skb; + dev-gso_skb = NULL; + count -= xmit_count_skbs(skb); + __skb_queue_tail(pktlist, skb); + } + + while (count 0) { + skb = q-dequeue(q); + if (!skb) + break; + + count -= xmit_count_skbs(skb); + __skb_queue_tail(pktlist, skb); + } + + return skb_queue_len(pktlist); +} /* * NOTE: Called under dev-queue_lock with locally disabled BH. @@ -133,19 +213,20 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb, static inline int qdisc_restart(struct net_device *dev) { struct Qdisc *q = dev-qdisc; - struct sk_buff *skb; - int ret, xcnt = 0; + int ret = 0; - /* Dequeue packet */ - if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL)) - return 0; + /* Dequeue packets */ + ret = xmit_get_pkts(dev, q, dev-blist); + if (!ret) + return 0; - /* And release queue */ + /* We got em packets */ spin_unlock(dev-queue_lock); + /* bye packets */ HARD_TX_LOCK(dev, smp_processor_id()); - ret = dev_hard_start_xmit(skb, dev, xcnt); + ret = dev_batch_xmit(dev); HARD_TX_UNLOCK(dev); spin_lock(dev-queue_lock); @@ -158,8 +239,8 @@ static inline int qdisc_restart(struct net_device *dev) break; case NETDEV_TX_LOCKED: - /* Driver try lock failed */ - ret = handle_dev_cpu_collision(skb, dev, q); + /* Driver lock failed */ + ret = xmit_islocked(dev-blist, dev, q); break; default: @@ -168,7 +249,7 @@ static inline int qdisc_restart(struct net_device *dev) printk(KERN_WARNING BUG %s code %d qlen %d\n, dev-name, ret, q-q.qlen); - ret = dev_requeue_skb(skb, dev, q); + ret = dev_requeue_skbs(dev-blist, dev, q); break; } @@ -564,6 +645,9 @@ void dev_deactivate(struct net_device *dev) skb = dev-gso_skb; dev-gso_skb = NULL; + if (!skb_queue_empty(dev-blist)) + skb_queue_purge(dev-blist); + dev-xmit_win = 1; spin_unlock_bh(dev-queue_lock); kfree_skb(skb); ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 3/3][NET_BATCH] Rev2.5 kill dev-gso_skb
This patch removes dev-gso_skb as it is no longer necessary with batching code. cheers, jamal [NET_BATCH] kill dev-gso_skb The batching code does what gso used to batch at the drivers. There is no more need for gso_skb. If for whatever reason the requeueing is a bad idea we are going to leave packets in dev-blist (and still not need dev-gso_skb) Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] --- commit fac8a4147548f314d4edb74634e78e5b06e0e135 tree 72114acb327bc7e3eb219275df6b3aab7459795c parent c73d8ee8cce61a98f55fbfb2cafe813a7eca472c author Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:22:43 -0400 committer Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:22:43 -0400 include/linux/netdevice.h |3 --- net/sched/sch_generic.c | 12 2 files changed, 0 insertions(+), 15 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index b0e71c9..7592a56 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -578,9 +578,6 @@ struct net_device struct list_head qdisc_list; unsigned long tx_queue_len; /* Max frames per queue allowed */ - /* Partially transmitted GSO packet. */ - struct sk_buff *gso_skb; - /* ingress path synchronizer */ spinlock_t ingress_lock; struct Qdisc *qdisc_ingress; diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index d98c680..36b6972 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -172,13 +172,6 @@ static int xmit_get_pkts(struct net_device *dev, struct sk_buff *skb; int count = dev-xmit_win; - if (count dev-gso_skb) { - skb = dev-gso_skb; - dev-gso_skb = NULL; - count -= xmit_count_skbs(skb); - __skb_queue_tail(pktlist, skb); - } - while (count 0) { skb = q-dequeue(q); if (!skb) @@ -635,7 +628,6 @@ void dev_activate(struct net_device *dev) void dev_deactivate(struct net_device *dev) { struct Qdisc *qdisc; - struct sk_buff *skb; spin_lock_bh(dev-queue_lock); qdisc = dev-qdisc; @@ -643,15 +635,11 @@ void dev_deactivate(struct net_device *dev) qdisc_reset(qdisc); - skb = dev-gso_skb; - dev-gso_skb = NULL; if (!skb_queue_empty(dev-blist)) skb_queue_purge(dev-blist); dev-xmit_win = 1; spin_unlock_bh(dev-queue_lock); - kfree_skb(skb); - dev_watchdog_down(dev); /* Wait for outstanding dev_queue_xmit calls. */ ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCHES] TX batching rev2.5
On Tue, 2007-09-10 at 18:07 -0400, jamal wrote: Please provide feedback on the code and/or architecture. They are now updated to work with the latest rebased net-2.6.24 from a few hours ago. I should have added i have tested this with just the batching changes and it is within the performance realm of the changes from yesterday. If anyone wants exact numbers, i can send them. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [DOC][NET_BATCH]Rev2.5 Driver Howto
I updated this doc to match the latest patch. cheers, jamal Here's the beginning of a howto for driver authors. The intended audience for this howto is people already familiar with netdevices. 1.0 Netdevice Prerequisites -- For hardware-based netdevices, you must have at least hardware that is capable of doing DMA with many descriptors; i.e., having hardware with a queue length of 3 (as in some fscked ethernet hardware) is not very useful in this case. 2.0 What is new in the driver API --- There is 1 new method and one new variable introduced that the driver author needs to be aware of. These are: 1) dev-hard_end_xmit() 2) dev-xmit_win 2.1 Using Core driver changes - To provide context, let's look at a typical driver abstraction for dev-hard_start_xmit(). It has 4 parts: a) packet formatting (example: vlan, mss, descriptor counting, etc.) b) chip-specific formatting c) enqueueing the packet on a DMA ring d) IO operations to complete packet transmit, tell DMA engine to chew on, tx completion interrupts, etc. [For code cleanliness/readability sake, regardless of this work, one should break the dev-hard_start_xmit() into those 4 functional blocks anyways]. A driver which has all 4 parts and needing to support batching is advised to split its dev-hard_start_xmit() in the following manner: 1) use its dev-hard_end_xmit() method to achieve #d 2) use dev-xmit_win to tell the core how much space you have. #b and #c can stay in -hard_start_xmit() (or whichever way you want to do this) Section 3. shows more details on the suggested usage. 2.1.1 Theory of operation -- 1. Core dequeues from qdiscs upto dev-xmit_win packets. Fragmented and GSO packets are accounted for as well. 2. Core grabs device's TX_LOCK 3. Core loop for all skbs: -invokes driver dev-hard_start_xmit() 4. Core invokes driver dev-hard_end_xmit() if packets xmitted 2.1.1.1 The slippery LLTX - Since these type of drivers are being phased out and they require extra code they will not be supported anymore. So as oct07 the code that supports them has been removed. 2.1.1.2 xmit_win dev-xmit_win variable is set by the driver to tell us how much space it has in its rings/queues. This detail is then used to figure out how many packets are retrieved from the qdisc queues (in order to send to the driver). dev-xmit_win is introduced to ensure that when we pass the driver a list of packets it will swallow all of them -- which is useful because we don't requeue to the qdisc (and avoids burning unnecessary CPU cycles or introducing any strange re-ordering). Essentially the driver signals us how much space it has for descriptors by setting this variable. 2.1.1.2.1 Setting xmit_win -- This variable should be set during xmit path shutdown(netif_stop), wakeup(netif_wake) and -hard_end_xmit(). In the case of the first one the value is set to 1 and in the other two it is set to whatever the driver deems to be available space on the ring. 3.0 Driver Essentials - The typical driver tx state machine is: -1- +Core sends packets +-- Driver puts packet onto hardware queue +if hardware queue is full, netif_stop_queue(dev) + -2- +core stops sending because of netif_stop_queue(dev) .. .. time passes ... .. -3- +--- driver has transmitted packets, opens up tx path by invoking netif_wake_queue(dev) -1- +Cycle repeats and core sends more packets (step 1). 3.1 Driver prerequisite -- This is _a very important_ requirement in making batching useful. The prerequisite for batching changes is that the driver should provide a low threshold to open up the tx path. Drivers such as tg3 and e1000 already do this. Before you invoke netif_wake_queue(dev) you check if there is a threshold of space reached to insert new packets. Here's an example of how I added it to tun driver. Observe the setting of dev-xmit_win. --- +#define NETDEV_LTT 4 /* the low threshold to open up the tx path */ .. .. u32 t = skb_queue_len(tun-readq); if (netif_queue_stopped(tun-dev) t NETDEV_LTT) { tun-dev-xmit_win = tun-dev-tx_queue_len; netif_wake_queue(tun-dev); } --- Heres how the batching e1000 driver does it: -- if (unlikely(cleaned netif_carrier_ok(netdev) E1000_DESC_UNUSED(tx_ring) = TX_WAKE_THRESHOLD)) { if (netif_queue_stopped(netdev)) { int rspace = E1000_DESC_UNUSED(tx_ring) - (MAX_SKB_FRAGS + 2); netdev-xmit_win = rspace; netif_wake_queue(netdev); } --- in tg3 code (with no batching changes) looks like: - if (netif_queue_stopped(tp-dev) (tg3_tx_avail(tp) TG3_TX_WAKEUP_THRESH(tp))) netif_wake_queue(tp-dev
[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
On Mon, 2007-08-10 at 10:33 +0530, Krishna Kumar2 wrote: As a side note: Any batching driver should _never_ have to requeue; if it does it is buggy. And the non-batching ones if they ever requeue will be a single packet, so not much reordering. On the contrary, batching LLTX drivers (if that is not ruled out) will very often requeue resulting in heavy reordering. Fix looks good though. Two things: one, LLTX is deprecated (I think i saw a patch which says no more new drivers should do LLTX) and i plan if nobody else does to kill LLTX in e1000 RSN. So for that reason i removed all code that existed to support LLTX. two, there should _never_ be any requeueing even if LLTX in the previous patches when i supported them; if there is, it is a bug. This is because we dont send more than what the driver asked for via xmit_win. So if it asked for more than it can handle, that is a bug. If its available space changes while we are sending to it, that too is a bug. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
On Sun, 2007-07-10 at 21:51 -0700, David Miller wrote: For these high performance 10Gbit cards it's a load balancing function, really, as all of the transmit queues go out to the same physical port so you could: 1) Load balance on CPU number. 2) Load balance on flow 3) Load balance on destination MAC etc. etc. etc. The brain-block i am having is the parallelization aspect of it. Whatever scheme it is - it needs to ensure the scheduler works as expected. For example, if it was a strict prio scheduler i would expect that whatever goes out is always high priority first and never ever allow a low prio packet out at any time theres something high prio needing to go out. If i have the two priorities running on two cpus, then i cant guarantee that effect. IOW, i see the scheduler/qdisc level as not being split across parallel cpus. Do i make any sense? The rest of my understanding hinges on the above, so let me stop here. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 1/3] [NET_BATCH] Introduce batching interface
On Mon, 2007-08-10 at 15:29 +0530, Krishna Kumar2 wrote: Hi Jamal, If you don't mind, I am trying to run your approach vs mine to get some results for comparison. Please provide an analysis when you get the results. IOW, explain why one vs the other get different results. For starters, I am having issues with iperf when using your infrastructure code with my IPoIB driver - about 100MB is sent and then everything stops for some reason. I havent tested with iperf in a while. Can you post the netstat on both sides when the driver stops? It does sound like a driver issue to me. The changes in the IPoIB driver that I made to support batching is to set BTX, set xmit_win, and dynamically reduce xmit_win on every xmit and increase xmit_win on every xmit completion. From driver howto: --- This variable should be set during xmit path shutdown(netif_stop), wakeup(netif_wake) and -hard_end_xmit(). In the case of the first one the value is set to 1 and in the other two it is set to whatever the driver deems to be available space on the ring. Is there anything else that is required from the driver? Your driver needs to also support wake thresholding. I will post the driver howto later today. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCHES] TX batching
On Mon, 2007-08-10 at 16:51 +0400, Evgeniy Polyakov wrote: it looks like you and Krishna use the same requeueing methods - get one from qdisk, queue it into blist, get next from qdisk, queue it, eventually start transmit, where you dequeue it one-by-one and send (or prepare and commit). This is not the 100% optimal approach, but if you proved it does not hurt usual network processing, it is ok. There are probably other bottlenecks that hide the need to optimize further. Number of comments dusted to very small - that's a sign, but I'm a bit lost - did you and Krishna create the competing approaches, or they can co-exist together, in the former case I doubt you can push, until all problematic places are resolved, in the latter case, this is probably ready. Thanks. I would like to make one more cleanup and get rid of the temporary pkt list in qdisc restart; now that i have defered the skb pre-format interface it is unnecessary. I have a day off today, so i will make changes, re-run tests and post again. I dont see something from Krishna's approach that i can take and reuse. This maybe because my old approaches have evolved from the same path. There is a long list but as a sample: i used to do a lot more work while holding the queue lock which i have now moved post queue lock; i dont have any speacial interfaces/tricks just for batching, i provide hints to the core of how much the driver can take etc etc. I have offered Krishna co-authorship if he makes the IPOIB driver to work on my patches, that offer still stands if he chooses to take it. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: parallel networking (was Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock)
On Mon, 2007-08-10 at 10:22 -0400, Jeff Garzik wrote: Any chance the NIC hardware could provide that guarantee? If you can get the scheduling/dequeuing to run on one CPU (as we do today) it should work; alternatively you can totaly bypass the qdisc subystem and go direct to the hardware for devices that are capable and that would work but would require huge changes. My fear is there's a mini-scheduler pieces running on multi cpus which is what i understood as being described. 8139cp, for example, has two TX DMA rings, with hardcoded characteristics: one is a high prio q, and one a low prio q. The logic is pretty simple: empty the high prio q first (potentially starving low prio q, in worst case). sounds like strict prio scheduling to me which says if low prio starves so be it In terms of overall parallelization, both for TX as well as RX, my gut feeling is that we want to move towards an MSI-X, multi-core friendly model where packets are LIKELY to be sent and received by the same set of [cpus | cores | packages | nodes] that the [userland] processes dealing with the data. Does putting things in the same core help? But overall i agree with your views. There are already some primitive NUMA bits in skbuff allocation, but with modern MSI-X and RX/TX flow hashing we could do a whole lot more, along the lines of better CPU scheduling decisions, directing flows to clusters of cpus, and generally doing a better job of maximizing cache efficiency in a modern multi-thread environment. I think i see the receive with a lot of clarity, i am still foggy on the txmit path mostly because of the qos/scheduling issues. IMO the current model where each NIC's TX completion and RX processes are both locked to the same CPU is outmoded in a multi-core world with modern NICs. :) Infact even with status quo theres a case that can be made to not bind to interupts. In my recent experience with batching, due to the nature of my test app, if i let the interupts float across multiple cpus i benefit. My app runs/binds a thread per CPU and so benefits from having more juice to send more packets per unit of time - something i wouldnt get if i was always running on one cpu. But when i do this i found that just because i have bound a thread to cpu3 doesnt mean that thread will always run on cpu3. If netif_wakeup happens on cpu1, scheduler will put the thread on cpu1 if it is to be run. It made sense to do that, it just took me a while to digest. But I readily admit general ignorance about the kernel process scheduling stuff, so my only idea about a starting point was to see how far to go with the concept of skb affinity -- a mask in sk_buff that is a hint about which cpu(s) on which the NIC should attempt to send and receive packets. When going through bonding or netfilter, it is trivial to 'or' together affinity masks. All the various layers of net stack should attempt to honor the skb affinity, where feasible (requires interaction with CFS scheduler?). There would be cache benefits if you can free the packet on the same cpu it was allocated; so the idea of skb affinity is useful in the minimal in that sense if you can pull it. Assuming hardware is capable, even if you just tagged it on xmit to say which cpu it was sent out on, and made sure thats where it is freed, that would be a good start. Note: The majority of the packet processing overhead is _still_ the memory subsystem latency; in my tests with batched pktgen improving the xmit subsystem meant the overhead on allocing and freeing the packets went to something 80%. So something along the lines of parallelizing based on a split of alloc free of sksb IMO on more cpus than where xmit/receive run would see more performance improvements. Or maybe skb affinity is a dumb idea. I wanted to get people thinking on the bigger picture. Parallelization starts at the user process. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [DOC][NET_BATCH] Driver howto
This is an updated driver howto for batching that works with patches from yesterday and the revised ones i am going to post. cheers, jamal Here's the beginning of a howto for driver authors. The intended audience for this howto is people already familiar with netdevices. 1.0 Netdevice Prerequisites -- For hardware-based netdevices, you must have at least hardware that is capable of doing DMA with many descriptors; i.e., having hardware with a queue length of 3 (as in some fscked ethernet hardware) is not very useful in this case. 2.0 What is new in the driver API --- There is 1 new method and one new variable introduced that the driver author needs to be aware of. These are: 1) dev-hard_end_xmit() 2) dev-xmit_win 2.1 Using Core driver changes - To provide context, let's look at a typical driver abstraction for dev-hard_start_xmit(). It has 4 parts: a) packet formatting (example: vlan, mss, descriptor counting, etc.) b) chip-specific formatting c) enqueueing the packet on a DMA ring d) IO operations to complete packet transmit, tell DMA engine to chew on, tx completion interrupts, etc. [For code cleanliness/readability sake, regardless of this work, one should break the dev-hard_start_xmit() into those 4 functional blocks anyways]. A driver which has all 4 parts and needing to support batching is advised to split its dev-hard_start_xmit() in the following manner: 1) use its dev-hard_end_xmit() method to achieve #d 2) use dev-xmit_win to tell the core how much space you have. #b and #c can stay in -hard_start_xmit() (or whichever way you want to do this) Section 3. shows more details on the suggested usage. 2.1.1 Theory of operation -- 1. Core dequeues from qdiscs upto dev-xmit_win packets. Fragmented and GSO packets are accounted for as well. 2. Core grabs device's TX_LOCK 3. Core loop for all skbs: -invokes driver dev-hard_start_xmit() 4. Core invokes driver dev-hard_end_xmit() if packets xmitted 2.1.1.1 The slippery LLTX - Since these type of drivers are being phased out and they require extra code they will not be supported anymore. So as oct07 the code that supports them has been removed. 2.1.1.2 xmit_win dev-xmit_win variable is set by the driver to tell us how much space it has in its rings/queues. This detail is then used to figure out how many packets are retrieved from the qdisc queues (in order to send to the driver). dev-xmit_win is introduced to ensure that when we pass the driver a list of packets it will swallow all of them -- which is useful because we don't requeue to the qdisc (and avoids burning unnecessary CPU cycles or introducing any strange re-ordering). Essentially the driver signals us how much space it has for descriptors by setting this variable. 2.1.1.2.1 Setting xmit_win -- This variable should be set during xmit path shutdown(netif_stop), wakeup(netif_wake) and -hard_end_xmit(). In the case of the first one the value is set to 1 and in the other two it is set to whatever the driver deems to be available space on the ring. 3.0 Driver Essentials - The typical driver tx state machine is: -1- +Core sends packets +-- Driver puts packet onto hardware queue +if hardware queue is full, netif_stop_queue(dev) + -2- +core stops sending because of netif_stop_queue(dev) .. .. time passes ... .. -3- +--- driver has transmitted packets, opens up tx path by invoking netif_wake_queue(dev) -1- +Cycle repeats and core sends more packets (step 1). 3.1 Driver prerequisite -- This is _a very important_ requirement in making batching useful. The prerequisite for batching changes is that the driver should provide a low threshold to open up the tx path. Drivers such as tg3 and e1000 already do this. Before you invoke netif_wake_queue(dev) you check if there is a threshold of space reached to insert new packets. Here's an example of how I added it to tun driver. Observe the setting of dev-xmit_win. --- +#define NETDEV_LTT 4 /* the low threshold to open up the tx path */ .. .. u32 t = skb_queue_len(tun-readq); if (netif_queue_stopped(tun-dev) t NETDEV_LTT) { tun-dev-xmit_win = tun-dev-tx_queue_len; netif_wake_queue(tun-dev); } --- Heres how the batching e1000 driver does it: -- if (unlikely(cleaned netif_carrier_ok(netdev) E1000_DESC_UNUSED(tx_ring) = TX_WAKE_THRESHOLD)) { if (netif_queue_stopped(netdev)) { int rspace = E1000_DESC_UNUSED(tx_ring) - (MAX_SKB_FRAGS + 2); netdev-xmit_win = rspace; netif_wake_queue(netdev); } --- in tg3 code (with no batching changes) looks like: - if (netif_queue_stopped(tp-dev) (tg3_tx_avail(tp
[ofa-general] [PATCHES] TX batching rev2
Please provide feedback on the code and/or architecture. Last time i posted them i received little. They are now updated to work with the latest net-2.6.24 from a few hours ago. Patch 1: Introduces batching interface Patch 2: Core uses batching interface Patch 3: get rid of dev-gso_skb What has changed since i posted last: Killed the temporary packet list that is passed to qdisc restart. Dave please let me know if this meets your desires to allow devices which are SG and able to compute CSUM benefit just in case i misunderstood. Herbert, if you can look at at least patch 3 i will appreaciate it (since it kills dev-gso_skb that you introduced). UPCOMING PATCHES --- As before: More patches to follow later if i get some feedback - i didnt want to overload people by dumping too many patches. Most of these patches mentioned below are ready to go; some need some re-testing and others need a little porting from an earlier kernel: - tg3 driver - tun driver - pktgen - netiron driver - e1000 driver (LLTX) - e1000e driver (non-LLTX) - ethtool interface - There is at least one other driver promised to me Theres also a driver-howto i wrote that was posted on netdev last week as well as one that describes the architectural decisions made. PERFORMANCE TESTING System under test hardware is still a 2xdual core opteron with a couple of tg3s. A test tool generates udp traffic of different sizes for upto 60 seconds per run or a total of 30M packets. I have 4 threads each running on a specific CPU which keep all the CPUs as busy as they can sending packets targetted at a directly connected box's udp discard port. All 4 CPUs target a single tg3 to send. The receiving box has a tc rule which counts and drops all incoming udp packets to discard port - this allows me to make sure that the receiver is not the bottleneck in the testing. Packet sizes sent are {8B, 32B, 64B, 128B, 256B, 512B, 1024B}. Each packet size run is repeated 10 times to ensure that there are no transients. The average of all 10 runs is then computed and collected. I do plan also to run forwarding and TCP tests in the future when the dust settles. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 1/3] [NET_BATCH] Introduce batching interface
This patch introduces the netdevice interface for batching. cheers, jamal [NET_BATCH] Introduce batching interface This patch introduces the netdevice interface for batching. BACKGROUND - A driver dev-hard_start_xmit() has 4 typical parts: a) packet formating (example vlan, mss, descriptor counting etc) b) chip specific formatting c) enqueueing the packet on a DMA ring d) IO operations to complete packet transmit, tell DMA engine to chew on, tx completion interupts, set last tx time, etc [For code cleanliness/readability sake, regardless of this work, one should break the dev-hard_start_xmit() into those 4 functions anyways]. INTRODUCING API --- With the api introduced in this patch, a driver which has all 4 parts and needing to support batching is advised to split its dev-hard_start_xmit() in the following manner: 1)Remove #d from dev-hard_start_xmit() and put it in dev-hard_end_xmit() method. 2)#b and #c can stay in -hard_start_xmit() (or whichever way you want to do this) 3) #a is deffered to future work to reduce confusion (since it holds on its own). Note: There are drivers which may need not support any of the two approaches (example the tun driver i patched) so the methods are optional. xmit_win variable is set by the driver to tell the core how much space it has to take on new skbs. It is introduced to ensure that when we pass the driver a list of packets it will swallow all of them - which is useful because we dont requeue to the qdisc (and avoids burning unnecessary cpu cycles or introducing any strange re-ordering). The driver tells us when it invokes netif_wake_queue how much space it has for descriptors by setting this variable. Refer to the driver howto for more details. THEORY OF OPERATION --- 1. Core dequeues from qdiscs upto dev-xmit_win packets. Fragmented and GSO packets are accounted for as well. 2. Core grabs TX_LOCK 3. Core loop for all skbs: invokes driver dev-hard_start_xmit() 4. Core invokes driver dev-hard_end_xmit() ACKNOWLEDGEMENT AND SOME HISTORY There's a lot of history and reasoning of why batching in a document i am writting which i may submit as a patch. Thomas Graf (who doesnt know this probably) gave me the impetus to start looking at this back in 2004 when he invited me to the linux conference he was organizing. Parts of what i presented in SUCON in 2004 talk about batching. Herbert Xu forced me to take a second look around 2.6.18 - refer to my netconf 2006 presentation. Krishna Kumar provided me with more motivation in May 2007 when he posted on netdev and engaged me. Sridhar Samudrala, Krishna Kumar, Matt Carlson, Michael Chan, Jeremy Ethridge, Evgeniy Polyakov, Sivakumar Subramani, David Miller, and Patrick McHardy, Jeff Garzik and Bill Fink have contributed in one or more of {bug fixes, enhancements, testing, lively discussion}. The Broadcom and neterion folks have been outstanding in their help. Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] --- commit 8b9960cfbb98d48c0ac30b2c00d27c25217adb26 tree d8ab164757cee431f71a6011df4b09e0c382e2da parent cef811600354e9f41f059570fc877d19a1daed99 author Jamal Hadi Salim [EMAIL PROTECTED] Mon, 08 Oct 2007 09:03:42 -0400 committer Jamal Hadi Salim [EMAIL PROTECTED] Mon, 08 Oct 2007 09:03:42 -0400 include/linux/netdevice.h | 11 +++ net/core/dev.c| 40 2 files changed, 51 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 91cd3f3..b31df5c 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -467,6 +467,7 @@ struct net_device #define NETIF_F_NETNS_LOCAL 8192 /* Does not change network namespaces */ #define NETIF_F_MULTI_QUEUE 16384 /* Has multiple TX/RX queues */ #define NETIF_F_LRO 32768 /* large receive offload */ +#define NETIF_F_BTX 65536 /* Capable of batch tx */ /* Segmentation offload features */ #define NETIF_F_GSO_SHIFT 16 @@ -595,6 +596,9 @@ struct net_device void *priv; /* pointer to private data */ int (*hard_start_xmit) (struct sk_buff *skb, struct net_device *dev); + void (*hard_end_xmit) (struct net_device *dev); + int xmit_win; + /* These may be needed for future network-power-down code. */ unsigned long trans_start; /* Time (in jiffies) of last Tx */ @@ -609,6 +613,7 @@ struct net_device /* delayed register/unregister */ struct list_head todo_list; + struct sk_buff_head blist; /* device index hash chain */ struct hlist_node index_hlist; @@ -1044,6 +1049,12 @@ extern int dev_set_mac_address(struct net_device *, struct sockaddr *); extern int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev); +extern int dev_batch_xmit(struct net_device *dev); +extern int prepare_gso_skb(struct sk_buff *skb, + struct net_device *dev, + struct sk_buff_head *skbs); +extern int
[ofa-general] [PATCH 2/3][NET_BATCH] net core use batching
This patch adds the usage of batching within the core. cheers, jamal [NET_BATCH] net core use batching This patch adds the usage of batching within the core. Performance results demonstrating improvement are provided separately. I have #if-0ed some of the old functions so the patch is more readable. A future patch will remove all if-0ed content. Patrick McHardy eyeballed a bug that will cause re-ordering in case of a requeue. Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] --- commit 63381156b35719a364d0f81fec487e6263fb5467 tree 615dfb5f8617a8a4edffbb965e81a75c60597319 parent 8b9960cfbb98d48c0ac30b2c00d27c25217adb26 author Jamal Hadi Salim [EMAIL PROTECTED] Mon, 08 Oct 2007 09:10:14 -0400 committer Jamal Hadi Salim [EMAIL PROTECTED] Mon, 08 Oct 2007 09:10:14 -0400 net/sched/sch_generic.c | 116 +++ 1 files changed, 106 insertions(+), 10 deletions(-) diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 95ae119..96bfdcb 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -56,6 +56,7 @@ static inline int qdisc_qlen(struct Qdisc *q) return q-q.qlen; } +#if 0 static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev, struct Qdisc *q) { @@ -110,6 +111,97 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb, return ret; } +#endif + +static inline int handle_dev_cpu_collision(struct net_device *dev) +{ + if (unlikely(dev-xmit_lock_owner == smp_processor_id())) { + if (net_ratelimit()) + printk(KERN_WARNING +Dead loop on netdevice %s, fix it urgently!\n, +dev-name); + return 1; + } + __get_cpu_var(netdev_rx_stat).cpu_collision++; + return 0; +} + +static inline int +dev_requeue_skbs(struct sk_buff_head *skbs, struct net_device *dev, + struct Qdisc *q) +{ + + struct sk_buff *skb; + + while ((skb = __skb_dequeue_tail(skbs)) != NULL) + q-ops-requeue(skb, q); + + netif_schedule(dev); + return 0; +} + +static inline int +xmit_islocked(struct sk_buff_head *skbs, struct net_device *dev, + struct Qdisc *q) +{ + int ret = handle_dev_cpu_collision(dev); + + if (ret) { + if (!skb_queue_empty(skbs)) + skb_queue_purge(skbs); + return qdisc_qlen(q); + } + + return dev_requeue_skbs(skbs, dev, q); +} + +static int xmit_count_skbs(struct sk_buff *skb) +{ + int count = 0; + for (; skb; skb = skb-next) { + count += skb_shinfo(skb)-nr_frags; + count += 1; + } + return count; +} + +static int xmit_get_pkts(struct net_device *dev, + struct Qdisc *q, + struct sk_buff_head *pktlist) +{ + struct sk_buff *skb; + int count = dev-xmit_win; + + if (count dev-gso_skb) { + skb = dev-gso_skb; + dev-gso_skb = NULL; + count -= xmit_count_skbs(skb); + __skb_queue_tail(pktlist, skb); + } + + while (count 0) { + skb = q-dequeue(q); + if (!skb) + break; + + count -= xmit_count_skbs(skb); + __skb_queue_tail(pktlist, skb); + } + + return skb_queue_len(pktlist); +} + +static int xmit_prepare_pkts(struct net_device *dev, + struct sk_buff_head *tlist) +{ + struct sk_buff *skb; + struct sk_buff_head *flist = dev-blist; + + while ((skb = __skb_dequeue(tlist)) != NULL) + xmit_prepare_skb(skb, dev); + + return skb_queue_len(flist); +} /* * NOTE: Called under dev-queue_lock with locally disabled BH. @@ -130,22 +222,23 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb, *0 - queue is not empty. * */ + static inline int qdisc_restart(struct net_device *dev) { struct Qdisc *q = dev-qdisc; - struct sk_buff *skb; - int ret; + int ret = 0; - /* Dequeue packet */ - if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL)) - return 0; + ret = xmit_get_pkts(dev, q, dev-blist); + if (!ret) + return 0; - /* And release queue */ + /* We got em packets */ spin_unlock(dev-queue_lock); + /* bye packets */ HARD_TX_LOCK(dev, smp_processor_id()); - ret = dev_hard_start_xmit(skb, dev); + ret = dev_batch_xmit(dev); HARD_TX_UNLOCK(dev); spin_lock(dev-queue_lock); @@ -158,8 +251,8 @@ static inline int qdisc_restart(struct net_device *dev) break; case NETDEV_TX_LOCKED: - /* Driver try lock failed */ - ret = handle_dev_cpu_collision(skb, dev, q); + /* Driver lock failed */ + ret = xmit_islocked(dev-blist, dev, q); break; default: @@ -168,7 +261,7 @@ static inline int qdisc_restart(struct net_device *dev) printk(KERN_WARNING BUG %s code %d qlen %d\n, dev-name, ret, q-q.qlen); - ret = dev_requeue_skb(skb, dev, q); + ret = dev_requeue_skbs(dev-blist, dev, q); break; } @@ -564,6 +657,9 @@ void dev_deactivate(struct net_device *dev) skb = dev-gso_skb; dev-gso_skb = NULL; + if (!skb_queue_empty(dev-blist)) + skb_queue_purge(dev-blist); + dev-xmit_win = 1; spin_unlock_bh(dev-queue_lock); kfree_skb(skb); ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe
[ofa-general] [PATCH 3/3][NET_BATCH] kill dev-gso_skb
This patch removes dev-gso_skb as it is no longer necessary with batching code. cheers, jamal [NET_BATCH] kill dev-gso_skb The batching code does what gso used to batch at the drivers. There is no more need for gso_skb. If for whatever reason the requeueing is a bad idea we are going to leave packets in dev-blist (and still not need dev-gso_skb) Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] --- commit 3a6202d62adff75b85d6ca0f8fd491abf9d63f4b tree 80497c538bdb3eab6ab81ff1dec1c3263da79826 parent 63381156b35719a364d0f81fec487e6263fb5467 author Jamal Hadi Salim [EMAIL PROTECTED] Mon, 08 Oct 2007 09:11:02 -0400 committer Jamal Hadi Salim [EMAIL PROTECTED] Mon, 08 Oct 2007 09:11:02 -0400 include/linux/netdevice.h |3 --- net/sched/sch_generic.c | 12 2 files changed, 0 insertions(+), 15 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index b31df5c..4ddc6eb 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -577,9 +577,6 @@ struct net_device struct list_head qdisc_list; unsigned long tx_queue_len; /* Max frames per queue allowed */ - /* Partially transmitted GSO packet. */ - struct sk_buff *gso_skb; - /* ingress path synchronizer */ spinlock_t ingress_lock; struct Qdisc *qdisc_ingress; diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 96bfdcb..9112ea0 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -172,13 +172,6 @@ static int xmit_get_pkts(struct net_device *dev, struct sk_buff *skb; int count = dev-xmit_win; - if (count dev-gso_skb) { - skb = dev-gso_skb; - dev-gso_skb = NULL; - count -= xmit_count_skbs(skb); - __skb_queue_tail(pktlist, skb); - } - while (count 0) { skb = q-dequeue(q); if (!skb) @@ -647,7 +640,6 @@ void dev_activate(struct net_device *dev) void dev_deactivate(struct net_device *dev) { struct Qdisc *qdisc; - struct sk_buff *skb; spin_lock_bh(dev-queue_lock); qdisc = dev-qdisc; @@ -655,15 +647,11 @@ void dev_deactivate(struct net_device *dev) qdisc_reset(qdisc); - skb = dev-gso_skb; - dev-gso_skb = NULL; if (!skb_queue_empty(dev-blist)) skb_queue_purge(dev-blist); dev-xmit_win = 1; spin_unlock_bh(dev-queue_lock); - kfree_skb(skb); - dev_watchdog_down(dev); /* Wait for outstanding dev_queue_xmit calls. */ ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [NET_BATCH] Some perf results
Ive attached a small pdf with results. This adds on top of results I posted yesterday (although i didnt see them reflected on netdev). 1) batch-ntlst is the patches posted today that remove the temporary list in qdisc restart and is derived from this AM net-2.6.24 2) batch-kern is result of batching patches posted yesterday that had the temporary list and is based on net-2.6.24 from yesterday AM 3) net-2.6.2 is yesterday's AM net-2.6.24 with no changes, So #1 is not a completely fair comparison with #2 and #3. However, looking at the logs, the changes that have gone in are unrelated to the areas i have touched, so i dont expect any effect. Overall, removing the temporay list from qdisc_restart provides a small improvement noticeable only at the smaller packet sizes. In any case, enjoy. cheers, jamal batch-res2.pdf Description: Adobe PDF document ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: [PATCH 2/3][NET_BATCH] net core use batching
On Mon, 2007-08-10 at 12:46 -0700, Waskiewicz Jr, Peter P wrote: I still have concerns how this will work with Tx multiqueue. The way the batching code looks right now, you will probably send a batch of skb's from multiple bands from PRIO or RR to the driver. For non-Tx multiqueue drivers, this is fine. For Tx multiqueue drivers, this isn't fine, since the Tx ring is selected by the value of skb-queue_mapping (set by the qdisc on {prio|rr}_classify()). If the whole batch comes in with different queue_mappings, this could prove to be an interesting issue. true, that needs some resolution. Heres a hand-waving thought: Assuming all packets of a specific map end up in the same qdiscn queue, it seems feasible to ask the qdisc scheduler to give us enough packages (ive seen people use that terms to refer to packets) for each hardware ring's available space. With the patches i posted, i do that via dev-xmit_win that assumes only one view of the driver; essentially a single ring. If that is doable, then it is up to the driver to say i have space for 5 in ring[0], 10 in ring[1] 0 in ring[2] based on what scheduling scheme the driver implements - the dev-blist can stay the same. Its a handwave, so there may be issues there and there could be better ways to handle this. Note: The other issue that needs resolving that i raised earlier was in regards to multiqueue running on multiple cpus servicing different rings concurently. Now I see in the driver HOWTO you recently sent that the driver will be expected to loop over the list and call it's -hard_start_xmit() for each skb. It's the core that does that, not the driver; the driver continues to use -hard_start_xmit() (albeit modified one). The idea is not to have many new interfaces. I think that should be fine for multiqueue, I just wanted to see if you had any thoughts on how it should work, any performance issues you can see (I can't think of any). Since the batching feature and Tx multiqueue are very new features, I'd like to make sure we can think of any possible issues with them coexisting before they are both mainline. Isnt multiqueue mainline already? Looking ahead for multiqueue, I'm still working on the per-queue lock implementation for multiqueue, which I know will not work with batching as it's designed today. The point behind batching is to reduce the cost of the locks by amortizing across the locks. Even better if one can, they should get rid of locks. Remind me, why do you need the per-queuemap lock? And is it needed from the enqueuing side too? Maybe lets start there to help me understand things? I'm still not sure how to handle this, because it really would require the batch you send to have the same queue_mapping in each skb, so you're grabbing the correct queue_lock. Sure, that is doable if the driver can set a per queue_mapping xmit_win and the qdisc can be taught to say give me packets for queue_mapping X Or, we could have the core grab all the queue locks for each skb-queue_mapping represented in the batch. That would block another batch though if it had any of those queues in it's next batch before the first one completed. Thoughts? I am not understanding the desire to have locks on a per-queuemap. I think the single queuelock we have today should suffice. If the intent is to have concurent cpus running to each hardware ring, then this is what i questioned earlier whether it was the right thing to do(very top of email where i mention it as other issue). cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: parallel networking
On Mon, 2007-08-10 at 14:11 -0700, David Miller wrote: The problem is that the packet schedulers want global guarantees on packet ordering, not flow centric ones. That is the issue Jamal is concerned about. indeed, thank you for giving it better wording. The more I think about it, the more inevitable it seems that we really might need multiple qdiscs, one for each TX queue, to pull this full parallelization off. But the semantics of that don't smell so nice either. If the user attaches a new qdisc to ethN, does it go to all the TX queues, or what? All of the traffic shaping technology deals with the device as a unary object. It doesn't fit to multi-queue at all. If you let only one CPU at a time access the xmit path you solve all the reordering. If you want to be more fine grained you make the serialization point as low as possible in the stack - perhaps in the driver. But I think even what we have today with only one cpu entering the dequeue/scheduler region, _for starters_, is not bad actually ;- What i am finding (and i can tell you i have been trying hard;-) is that a sufficiently fast cpu doesnt sit in the dequeue area for too long (and batching reduces the time spent further). Very quickly there are no more packets for it to dequeue from the qdisc or the driver is stoped and it has to get out of there. If you dont have any interupt tied to a specific cpu then you can have many cpus enter and leave that region all the time. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
On Mon, 2007-08-10 at 14:26 -0700, David Miller wrote: Add xmit_win to struct net_device_subqueue, problem solved. If net_device_subqueue is visible from both driver and core scheduler area (couldnt tell from looking at whats in there already), then that'll do it. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: [PATCH 2/3][NET_BATCH] net core use batching
On Mon, 2007-08-10 at 15:33 -0700, Waskiewicz Jr, Peter P wrote: Addressing your note/issue with different rings being services concurrently: I'd like to remove the QDISC_RUNNING bit from the global The challenge to deal with is that netdevices, filters, the queues and scheduler are closely inter-twined. So it is not just the scheduling region and QDISC_RUNNING. For example, lets pick just the filters because they are simple to see: You need to attach them to something - whatever that is, you then need to synchronize against config and multiple cpus trying to use them. You could: a) replicate them across cpus and only lock on config, but you are wasting RAM then b) attach them to rings instead of netdevices - but that makes me wonder if those subqueues are now going to become netdevices. This also means you change all user space interfaces to know about subqueues. If you recall this was a major contention in our earlier discussion. device; with Tx multiqueue, this bit should be set on each queue (if at all), allowing multiple Tx rings to be loaded simultaneously. This is the issue i raised - refer to Dave's wording of it. If you run access to the rings simultenously you may not be able to guarantee any ordering or proper qos in contention for wire-resources (think strict prio in hardware) - as long as you have the qdisc area. You may actually get away with it with something like DRR. You could totaly bypass the qdisc region and go to the driver directly and let it worry about the scheduling but youd have to make the qdisc area a passthrough while providing the illusion to user space that all is as before. The biggest issue today with the multiqueue implementation is the global queue_lock. I see it being a hot source of contention in my testing; my setup is a 8-core machine (dual quad-core procs) with a 10GbE NIC, using 8 Tx and 8 Rx queues. On transmit, when loading all 8 queues, the enqueue/dequeue are hitting that lock quite a bit for the whole device. Yes, the queuelock is expensive; in your case if all 8 hardware threads are contending for that one device, you will suffer. The txlock on the other hand is not that expensive since the contention is for a max of 2 cpus (tx and rx softirq). I tried to use that fact in the batching to move things that i processed under queue lock into the area for txlock. I'd be very interested in some results on such a piece of hardware with the 10G nic to see if these theories make any sense. I really think that the queue_lock should join the queue_state, so the device no longer manages the top-level state (since we're operating per-queue instead of per-device). Refer to above. The multiqueue implementation today enforces the number of qdisc bands (RR or PRIO) to be equal to the number of Tx rings your hardware/driver is supporting. Therefore, the queue_lock and queue_state in the kernel directly relate to the qdisc band management. If the queue stops from the driver, then the qdisc won't try to dequeue from the band. Good start. What I'm working on is to move the lock there too, so I can lock the queue when I enqueue (protect the band from multiple sources modifying the skb chain), and lock it when I dequeue. This is purely for concurrency of adding/popping skb's from the qdisc queues. Ok, so the concurency aspect is what worries me. What i am saying is that sooner or later you have to serialize (which is anti-concurency) For example, consider CPU0 running a high prio queue and CPU1 running the low prio queue of the same netdevice. Assume CPU0 is getting a lot of interupts or other work while CPU1 doesnt (so as to create a condition that CPU1 is slower). Then as long as there packets and there is space on the drivers rings, CPU1 will send more packets per unit time than CPU0. This contradicts the strict prio scheduler which says higher priority packets ALWAYS go out first regardless of the presence of low prio packets. I am not sure i made sense. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: parallel networking
On Mon, 2007-08-10 at 15:33 -0700, David Miller wrote: Multiply whatever effect you think you might be able to measure due to that on your 2 or 4 way system, and multiple it up to 64 cpus or so for machines I am using. This is where machines are going, and is going to become the norm. Yes, i keep forgetting that ;- I need to train my brain to remember that. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
On Mon, 2007-08-10 at 18:41 -0700, David Miller wrote: I also want to point out another issue. Any argument wrt. reordering is specious at best because right now reordering from qdisc to device happens anyways. And that's because we drop the qdisc lock first, then we grab the transmit lock on the device and submit the packet. So, after we drop the qdisc lock, another cpu can get the qdisc lock, get the next packet (perhaps a lower priority one) and then sneak in to get the device transmit lock before the first thread can, and thus the packets will be submitted out of order. You forgot QDISC_RUNNING Dave;- the above cant happen. Essentially at any one point in time, we are guaranteed that we can have multiple cpus enqueueing but only can be dequeueing (the one that managed to grab QDISC_RUNNING) i.e multiple producers to the qdisc queue but only one consumer. Only the dequeuer has access to the txlock. This, along with other things, makes me believe that ordering really doesn't matter in practice. And therefore, in practice, we can treat everything from the qdisc to the real hardware as a FIFO even if something else is going on inside the black box which might reorder packets on the wire. I think it is important to get the scheduling right - estimations can be a last resort. For example, If i have voip competing for the wire with ftp on two different rings/cpus and i specified that voip should be more important i may consider equipment faulty if it works most of the time (when ftp is not clogging the wire) and at times i am asked to repeat what i just said. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
On Tue, 2007-09-10 at 10:04 +0800, Herbert Xu wrote: Please revert commit 41843197b17bdfb1f97af0a87c06d24c1620ba90 Author: Jamal Hadi Salim [EMAIL PROTECTED] Date: Tue Sep 25 19:27:13 2007 -0700 [NET_SCHED]: explict hold dev tx lock As this change introduces potential reordering and I don't think we've discussed this aspect sufficiently. How does it introduce reordering? cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
On Tue, 2007-09-10 at 10:16 +0800, Herbert Xu wrote: No it doesn't. I'd forgotten about the QDISC_RUNNING bit :) You should not better, you wrote it and ive been going insane trying to break for at least a year now ;- cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCHES] TX batching
Please provide feedback on the code and/or architecture. Last time i posted them i received little. They are now updated to work with the latest net-2.6.24 from a few hours ago. Patch 1: Introduces batching interface Patch 2: Core uses batching interface Patch 3: get rid of dev-gso_skb What has changed since i posted last: 1) Fix a bug eyeballed by Patrick McHardy on requeue reordering. 2) Killed -hard_batch_xmit() 3) I am going one step back and making this set of patches even simpler so i can make it easier to review.I am therefore killing dev-hard_prep_xmit() and focussing just on batching. I plan to re-introduce dev-hard_prep_xmit() but from now on i will make that a separate effort. (it seems to be creating confusion in relation to the general work). Dave please let me know if this meets your desires to allow devices which are SG and able to compute CSUM benefit just in case i misunderstood. Herbert, if you can look at at least patch 3 i will appreaciate it (since it kills dev-gso_skb that you introduced). UPCOMING PATCHES --- As before: More patches to follow later if i get some feedback - i didnt want to overload people by dumping too many patches. Most of these patches mentioned below are ready to go; some need some re-testing and others need a little porting from an earlier kernel: - tg3 driver - tun driver - pktgen - netiron driver - e1000 driver (LLTX) - e1000e driver (non-LLTX) - ethtool interface - There is at least one other driver promised to me Theres also a driver-howto i wrote that was posted on netdev last week as well as one that describes the architectural decisions made. PERFORMANCE TESTING I started testing since yesterday, but these tests take a long time so i will post results probably at the end of the day sometime and may stop running more tests and just comparing batch vs non-batch results. I have optimized the kernel-config so i expect my overall performance numbers to look better than the last test results i posted for both batch and non-batch. My system under test hardware is still a 2xdual core opteron with a couple of tg3s. A test tool generates udp traffic of different sizes for upto 60 seconds per run or a total of 30M packets. I have 4 threads each running on a specific CPU which keep all the CPUs as busy as they can sending packets targetted at a directly connected box's udp discard port. All 4 CPUs target a single tg3 to send. The receiving box has a tc rule which counts and drops all incoming udp packets to discard port - this allows me to make sure that the receiver is not the bottleneck in the testing. Packet sizes sent are {8B, 32B, 64B, 128B, 256B, 512B, 1024B}. Each packet size run is repeated 10 times to ensure that there are no transients. The average of all 10 runs is then computed and collected. I do plan also to run forwarding and TCP tests in the future when the dust settles. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 1/3] [NET_BATCH] Introduce batching interface
This patch introduces the netdevice interface for batching. cheers, jamal [NET_BATCH] Introduce batching interface This patch introduces the netdevice interface for batching. BACKGROUND - A driver dev-hard_start_xmit() has 4 typical parts: a) packet formating (example vlan, mss, descriptor counting etc) b) chip specific formatting c) enqueueing the packet on a DMA ring d) IO operations to complete packet transmit, tell DMA engine to chew on, tx completion interupts, set last tx time, etc [For code cleanliness/readability sake, regardless of this work, one should break the dev-hard_start_xmit() into those 4 functions anyways]. INTRODUCING API --- With the api introduced in this patch, a driver which has all 4 parts and needing to support batching is advised to split its dev-hard_start_xmit() in the following manner: 1)Remove #d from dev-hard_start_xmit() and put it in dev-hard_end_xmit() method. 2)#b and #c can stay in -hard_start_xmit() (or whichever way you want to do this) 3) #a is deffered to future work to reduce confusion (since it holds on its own). Note: There are drivers which may need not support any of the two approaches (example the tun driver i patched) so the methods are optional. xmit_win variable is set by the driver to tell the core how much space it has to take on new skbs. It is introduced to ensure that when we pass the driver a list of packets it will swallow all of them - which is useful because we dont requeue to the qdisc (and avoids burning unnecessary cpu cycles or introducing any strange re-ordering). The driver tells us when it invokes netif_wake_queue how much space it has for descriptors by setting this variable. Refer to the driver howto for more details. THEORY OF OPERATION --- 1. Core dequeues from qdiscs upto dev-xmit_win packets. Fragmented and GSO packets are accounted for as well. 2. Core grabs TX_LOCK 3. Core loop for all skbs: invokes driver dev-hard_start_xmit() 4. Core invokes driver dev-hard_end_xmit() ACKNOWLEDGEMENT AND SOME HISTORY There's a lot of history and reasoning of why batching in a document i am writting which i may submit as a patch. Thomas Graf (who doesnt know this probably) gave me the impetus to start looking at this back in 2004 when he invited me to the linux conference he was organizing. Parts of what i presented in SUCON in 2004 talk about batching. Herbert Xu forced me to take a second look around 2.6.18 - refer to my netconf 2006 presentation. Krishna Kumar provided me with more motivation in May 2007 when he posted on netdev and engaged me. Sridhar Samudrala, Krishna Kumar, Matt Carlson, Michael Chan, Jeremy Ethridge, Evgeniy Polyakov, Sivakumar Subramani, David Miller, and Patrick McHardy, Jeff Garzik and Bill Fink have contributed in one or more of {bug fixes, enhancements, testing, lively discussion}. The Broadcom and neterion folks have been outstanding in their help. Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] --- commit 0a0762e2c615a980af284e86d9729d233e1bf7f4 tree c27fec824a9e75ffbb791647bdb595c082a54990 parent 190674ff1fe0b7bddf038c2bfddf45b9c6418e2a author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 07 Oct 2007 08:51:10 -0400 committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 07 Oct 2007 08:51:10 -0400 include/linux/netdevice.h | 11 ++ net/core/dev.c| 83 + 2 files changed, 94 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 91cd3f3..b31df5c 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -467,6 +467,7 @@ struct net_device #define NETIF_F_NETNS_LOCAL8192/* Does not change network namespaces */ #define NETIF_F_MULTI_QUEUE16384 /* Has multiple TX/RX queues */ #define NETIF_F_LRO32768 /* large receive offload */ +#define NETIF_F_BTX65536 /* Capable of batch tx */ /* Segmentation offload features */ #define NETIF_F_GSO_SHIFT 16 @@ -595,6 +596,9 @@ struct net_device void*priv; /* pointer to private data */ int (*hard_start_xmit) (struct sk_buff *skb, struct net_device *dev); + void(*hard_end_xmit) (struct net_device *dev); + int xmit_win; + /* These may be needed for future network-power-down code. */ unsigned long trans_start;/* Time (in jiffies) of last Tx */ @@ -609,6 +613,7 @@ struct net_device /* delayed register/unregister */ struct list_headtodo_list; + struct sk_buff_head blist; /* device index hash chain */ struct hlist_node index_hlist; @@ -1044,6 +1049,12 @@ extern int dev_set_mac_address(struct net_device
[ofa-general] [PATCH 2/3][NET_BATCH] net core use batching
This patch adds the usage of batching within the core. cheers, jamal [NET_BATCH] net core use batching This patch adds the usage of batching within the core. Performance results demonstrating improvement are provided separately. I have #if-0ed some of the old functions so the patch is more readable. A future patch will remove all if-0ed content. Patrick McHardy eyeballed a bug that will cause re-ordering in case of a requeue. Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] --- commit cd602aa5f84fcef6359852cd99c95863eeb91015 tree f31d2dde4f138ff6789682163624bc0f8541aa77 parent 0a0762e2c615a980af284e86d9729d233e1bf7f4 author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 07 Oct 2007 09:13:04 -0400 committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 07 Oct 2007 09:13:04 -0400 net/sched/sch_generic.c | 132 +++ 1 files changed, 120 insertions(+), 12 deletions(-) diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 95ae119..80ac56b 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -56,6 +56,7 @@ static inline int qdisc_qlen(struct Qdisc *q) return q-q.qlen; } +#if 0 static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev, struct Qdisc *q) { @@ -110,6 +111,97 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb, return ret; } +#endif + +static inline int handle_dev_cpu_collision(struct net_device *dev) +{ + if (unlikely(dev-xmit_lock_owner == smp_processor_id())) { + if (net_ratelimit()) + printk(KERN_WARNING + Dead loop on netdevice %s, fix it urgently!\n, + dev-name); + return 1; + } + __get_cpu_var(netdev_rx_stat).cpu_collision++; + return 0; +} + +static inline int +dev_requeue_skbs(struct sk_buff_head *skbs, struct net_device *dev, + struct Qdisc *q) +{ + + struct sk_buff *skb; + + while ((skb = __skb_dequeue_tail(skbs)) != NULL) + q-ops-requeue(skb, q); + + netif_schedule(dev); + return 0; +} + +static inline int +xmit_islocked(struct sk_buff_head *skbs, struct net_device *dev, + struct Qdisc *q) +{ + int ret = handle_dev_cpu_collision(dev); + + if (ret) { + if (!skb_queue_empty(skbs)) + skb_queue_purge(skbs); + return qdisc_qlen(q); + } + + return dev_requeue_skbs(skbs, dev, q); +} + +static int xmit_count_skbs(struct sk_buff *skb) +{ + int count = 0; + for (; skb; skb = skb-next) { + count += skb_shinfo(skb)-nr_frags; + count += 1; + } + return count; +} + +static int xmit_get_pkts(struct net_device *dev, + struct Qdisc *q, + struct sk_buff_head *pktlist) +{ + struct sk_buff *skb; + int count = dev-xmit_win; + + if (count dev-gso_skb) { + skb = dev-gso_skb; + dev-gso_skb = NULL; + count -= xmit_count_skbs(skb); + __skb_queue_tail(pktlist, skb); + } + + while (count 0) { + skb = q-dequeue(q); + if (!skb) + break; + + count -= xmit_count_skbs(skb); + __skb_queue_tail(pktlist, skb); + } + + return skb_queue_len(pktlist); +} + +static int xmit_prepare_pkts(struct net_device *dev, +struct sk_buff_head *tlist) +{ + struct sk_buff *skb; + struct sk_buff_head *flist = dev-blist; + + while ((skb = __skb_dequeue(tlist)) != NULL) + xmit_prepare_skb(skb, dev); + + return skb_queue_len(flist); +} /* * NOTE: Called under dev-queue_lock with locally disabled BH. @@ -130,22 +222,32 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb, * 0 - queue is not empty. * */ -static inline int qdisc_restart(struct net_device *dev) + +static inline int qdisc_restart(struct net_device *dev, + struct sk_buff_head *tpktlist) { struct Qdisc *q = dev-qdisc; - struct sk_buff *skb; - int ret; + int ret = 0; - /* Dequeue packet */ - if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL)) - return 0; + /* use of tpktlist reduces the amount of time we sit +* holding the queue_lock + */ + ret = xmit_get_pkts(dev, q, tpktlist); + if (!ret) + return 0; - /* And release queue */ + /* We got em packets */ spin_unlock(dev-queue_lock); + /* prepare to embark, no locks held moves packets + * to dev-blist + * */ + xmit_prepare_pkts(dev, tpktlist); + + /* bye packets */ HARD_TX_LOCK(dev, smp_processor_id()); - ret
[ofa-general] [PATCH 3/3][NET_BATCH] kill dev-gso_skb
This patch removes dev-gso_skb as it is no longer necessary with batching code. cheers, jamal [NET_BATCH] kill dev-gso_skb The batching code does what gso used to batch at the drivers. There is no more need for gso_skb. If for whatever reason the requeueing is a bad idea we are going to leave packets in dev-blist (and still not need dev-gso_skb) Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] --- commit 7ebf50f0f43edd4897b88601b4133612fc36af61 tree 5d942ecebc14de6254ab3c812d542d524e148e92 parent cd602aa5f84fcef6359852cd99c95863eeb91015 author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 07 Oct 2007 09:30:19 -0400 committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 07 Oct 2007 09:30:19 -0400 include/linux/netdevice.h |3 --- net/sched/sch_generic.c | 12 2 files changed, 0 insertions(+), 15 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index b31df5c..4ddc6eb 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -577,9 +577,6 @@ struct net_device struct list_headqdisc_list; unsigned long tx_queue_len; /* Max frames per queue allowed */ - /* Partially transmitted GSO packet. */ - struct sk_buff *gso_skb; - /* ingress path synchronizer */ spinlock_t ingress_lock; struct Qdisc*qdisc_ingress; diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 80ac56b..772e7fe 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -172,13 +172,6 @@ static int xmit_get_pkts(struct net_device *dev, struct sk_buff *skb; int count = dev-xmit_win; - if (count dev-gso_skb) { - skb = dev-gso_skb; - dev-gso_skb = NULL; - count -= xmit_count_skbs(skb); - __skb_queue_tail(pktlist, skb); - } - while (count 0) { skb = q-dequeue(q); if (!skb) @@ -659,7 +652,6 @@ void dev_activate(struct net_device *dev) void dev_deactivate(struct net_device *dev) { struct Qdisc *qdisc; - struct sk_buff *skb; spin_lock_bh(dev-queue_lock); qdisc = dev-qdisc; @@ -667,15 +659,11 @@ void dev_deactivate(struct net_device *dev) qdisc_reset(qdisc); - skb = dev-gso_skb; - dev-gso_skb = NULL; if (!skb_queue_empty(dev-blist)) skb_queue_purge(dev-blist); dev-xmit_win = 1; spin_unlock_bh(dev-queue_lock); - kfree_skb(skb); - dev_watchdog_down(dev); /* Wait for outstanding dev_queue_xmit calls. */ ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] NET_BATCH: some results
It seems prettier to just draw graphs and since this one is small file; here it is attached. The graph demos a patched net-2.6.24 vs a plain net-2.6.24 kernel with a udp app that sends on 4 CPUs as fast as the the lower layers would allow it. Refer to my earlier description of the test setup etc. As i noted earlier on, for this hardware at about 200B or so, we approach wire speed, so the app is mostly idle above that as the link becomes the bottleneck; example it is 85% idle on 512B and 90% idle on 1024B. This is so for either batch or non-batch. So the differentiation is really in the smaller sized packets. Enjoy! cheers, jamal batch-pps.pdf Description: Adobe PDF document ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
On Wed, 2007-03-10 at 01:29 -0400, Bill Fink wrote: It does sound sensible. My own decidedly non-expert speculation was that the big 30 % performance hit right at 4 KB may be related to memory allocation issues or having to split the skb across multiple 4 KB pages. plausible. But i also worry it could be 10 other things; example, could it be the driver used? I noted in my udp test the oddity that turned out to be tx coal parameter related. In any case, I will attempt to run those tests later. And perhaps it only affected the single process case because with multiple processes lock contention may be a bigger issue and the xmit batching changes would presumably help with that. I am admittedly a novice when it comes to the detailed internals of TCP/skb processing, although I have been slowly slogging my way through parts of the TCP kernel code to try and get a better understanding, so I don't know if these thoughts have any merit. You do bring up issues that need to be looked into and i will run those tests. Note, the effectiveness of batching becomes evident as the number of flows grows. Actually, scratch that: It becomes evident if you can keep the tx path busyed out to which multiple users running contribute. If i can have a user per CPU with lots of traffic to send, i can create that condition. It's a little boring in the scenario where the bottleneck is the wire but it needs to be checked. BTW does anyone know of a good book they would recommend that has substantial coverage of the Linux kernel TCP code, that's fairly up-to-date and gives both an overall view of the code and packet flow as well as details on individual functions and algorithms, and hopefully covers basic issues like locking and synchronization, concurrency of different parts of the stack, and memory allocation. I have several books already on Linux kernel and networking internals, but they seem to only cover the IP (and perhaps UDP) portions of the network stack, and none have more than a cursory reference to TCP. The most useful documentation on the Linux TCP stack that I have found thus far is some of Dave Miller's excellent web pages and a few other web references, but overall it seems fairly skimpy for such an important part of the Linux network code. Reading books or magazines may end up busying you out with some small gains of knowledge at the end. They tend to be outdated fast. My advice is if you start with a focus on one thing, watch the patches that fly around on that area and learn that way. Read the code to further understand things then ask questions when its not clear. Other folks may have different views. The other way to do it is pick yourself some task to either add or improve something and get your hands dirty that way. It would be good to see some empirical evidence that there aren't any unforeseen gotchas for larger packet sizes, that at least the same level of performance can be obtained with no greater CPU utilization. Reasonable - I will try with 9K after i move over to the new tree from Dave and make sure nothing else broke in the previous tests. And when all looks good, i will move to TCP. [1] On average i spend 10x more time performance testing and analysing results than writting code. As you have written previously, and I heartily agree with, this is a very good practice for developing performance enhancement patches. To give you a perspective, the results i posted were each run 10 iterations per packet size per kernel. Each run is 60 seconds long. I think i am past that stage for resolving or fixing anything for UDP or pktgen, but i need to keep checking for any new regressions when Dave updates his tree. Now multiply that by 5 packet sizes (I am going to add 2 more) and multiply that by 3-4 kernels. Then add the time it takes to sift through the data and collect it then analyze it and go back to the drawing table when something doesnt look right. Essentially, it needs a weekend ;- cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
On Tue, 2007-02-10 at 00:25 -0400, Bill Fink wrote: One reason I ask, is that on an earlier set of alternative batching xmit patches by Krishna Kumar, his performance testing showed a 30 % performance hit for TCP for a single process and a size of 4 KB, and a performance hit of 5 % for a single process and a size of 16 KB (a size of 8 KB wasn't tested). Unfortunately I was too busy at the time to inquire further about it, but it would be a major potential concern for me in my 10-GigE network testing with 9000-byte jumbo frames. Of course the single process and 4 KB or larger size was the only case that showed a significant performance hit in Krishna Kumar's latest reported test results, so it might be acceptable to just have a switch to disable the batching feature for that specific usage scenario. So it would be useful to know if your xmit batching changes would have similar issues. There were many times while testing that i noticed inconsistencies and in each case when i analysed[1], i found it to be due to some variable other than batching which needed some resolving, always via some parametrization or other. I suspect what KK posted is in the same class. To give you an example, with UDP, batching was giving worse results at around 256B compared to 64B or 512B; investigating i found that the receiver just wasnt able to keep up and the udp layer dropped a lot of packets so both iperf and netperf reported bad numbers. Fixing the receiver ended up with consistency coming back. On why 256B was the one that overwhelmed the receiver more than 64B(which sent more pps)? On some limited investigation, it seemed to me to be the effect of the choice of the tg3 driver's default tx mitigation parameters as well tx ring size; which is something i plan to revisit (but neutralizing it helps me focus on just batching). In the end i dropped both netperf and iperf for similar reasons and wrote my own app. What i am trying to achieve is demonstrate if batching is a GoodThing. In experimentation like this, it is extremely valuable to reduce the variables. Batching may expose other orthogonal issues - those need to be resolved or fixed as they are found. I hope that sounds sensible. Back to the =9K packet size you raise above: I dont have a 10Gige card so iam theorizing. Given that theres an observed benefit to batching for a saturated link with smaller packets (in my results small is anything below 256B which maps to about 380Kpps anything above that seems to approach wire speed and the link is the bottleneck); then i theorize that 10Gige with 9K jumbo frames if already achieving wire rate, should continue to do so. And sizes below that will see improvements if they were not already hitting wire rate. So i would say that with 10G NICS, there will be more observed improvements with batching with apps that do bulk transfers (assuming those apps are not seeing wire speed already). Note that this hasnt been quiet the case even with TSO given the bottlenecks in the Linux receivers that J Heffner put nicely in a response to some results you posted - but that exposes an issue with Linux receivers rather than TSO. Also for your xmit batching changes, I think it would be good to see performance comparisons for TCP and IP forwarding in addition to your UDP pktgen tests, That is not pktgen - it is a udp app running in process context utilizing all 4CPUs to send traffic. pktgen bypasses the stack entirely and has its own merits in proving that batching infact is a GoodThing even if it is just for traffic generation ;- including various packet sizes up to and including 9000-byte jumbo frames. I will do TCP and forwarding tests in the near future. cheers, jamal [1] On average i spend 10x more time performance testing and analysing results than writting code. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
On Mon, 2007-01-10 at 12:42 +0200, Patrick McHardy wrote: jamal wrote: + while ((skb = __skb_dequeue(skbs)) != NULL) + q-ops-requeue(skb, q); -requeue queues at the head, so this looks like it would reverse the order of the skbs. Excellent catch! thanks; i will fix. As a side note: Any batching driver should _never_ have to requeue; if it does it is buggy. And the non-batching ones if they ever requeue will be a single packet, so not much reordering. Thanks again Patrick. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
On Mon, 2007-01-10 at 00:11 -0400, Bill Fink wrote: Have you done performance comparisons for the case of using 9000-byte jumbo frames? I havent, but will try if any of the gige cards i have support it. As a side note: I have not seen any useful gains or losses as the packet size approaches even 1500B MTU. For example, post about 256B neither the batching nor the non-batching give much difference in either throughput or cpu use. Below 256B, theres a noticeable gain for batching. Note, in the cases of my tests all 4 CPUs are in full-throttle UDP and so the occupancy of both the qdisc queue(s) and ethernet ring is constantly high. For example at 512B, the app is 80% idle on all 4 CPUs and we are hitting in the range of wire speed. We are at 90% idle at 1024B. This is the case with or without batching. So my suspicion is that with that trend a 9000B packet will just follow the same pattern. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCHES] TX batching
Latest net-2.6.24 breaks the patches i posted last week; so this is an update to resolve that. If you are receiving these emails and are finding them overloading, please give me a shout and i will remove your name. Please provide feedback on the code and/or architecture. Last time i posted them i received none. They are now updated to work with the latest net-2.6.24 from a few hours ago. Patch 1: Introduces batching interface Patch 2: Core uses batching interface Patch 3: get rid of dev-gso_skb I have decided i will kill -hard_batch_xmit() and not support any more LLTX drivers. This is the last of patches that will have -hard_batch_xmit() as i am supporting an e1000 that is LLTX. Dave please let me know if this meets your desires to allow devices which are SG and able to compute CSUM benefit just in case i misunderstood. Herbert, if you can look at at least patch 3 i will appreaciate it (since it kills dev-gso_skb that you introduced). More patches to follow later if i get some feedback - i didnt want to overload people by dumping too many patches. Most of these patches mentioned below are ready to go; some need some re-testing and others need a little porting from an earlier kernel: - tg3 driver (tested and works well, but dont want to send - tun driver - pktgen - netiron driver - e1000 driver (LLTX) - e1000e driver (non-LLTX) - ethtool interface - There is at least one other driver promised to me Theres also a driver-howto i wrote that was posted on netdev last week as well as one that describes the architectural decisions made. Each of these patches has been performance tested (last with DaveM's tree from last weekend) and the results are in the logs on a per-patch basis. My system under test hardware is a 2xdual core opteron with a couple of tg3s. I have not re-run the tests with this morning's tree but i suspect not much difference. My test tool generates udp traffic of different sizes for upto 60 seconds per run or a total of 30M packets. I have 4 threads each running on a specific CPU which keep all the CPUs as busy as they can sending packets targetted at a directly connected box's udp discard port. All 4 CPUs target a single tg3 to send. The receiving box has a tc rule which counts and drops all incoming udp packets to discard port - this allows me to make sure that the receiver is not the bottleneck in the testing. Packet sizes sent are {64B, 128B, 256B, 512B, 1024B}. Each packet size run is repeated 10 times to ensure that there are no transients. The average of all 10 runs is then computed and collected. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 1/4] [NET_BATCH] Introduce batching interface
This patch introduces the netdevice interface for batching. cheers, jamal [NET_BATCH] Introduce batching interface This patch introduces the netdevice interface for batching. A typical driver dev-hard_start_xmit() has 4 parts: a) packet formating (example vlan, mss, descriptor counting etc) b) chip specific formatting c) enqueueing the packet on a DMA ring d) IO operations to complete packet transmit, tell DMA engine to chew on, tx completion interupts etc [For code cleanliness/readability sake, regardless of this work, one should break the dev-hard_start_xmit() into those 4 functions anyways]. With the api introduced in this patch, a driver which has all 4 parts and needing to support batching is advised to split its dev-hard_start_xmit() in the following manner: 1)use its dev-hard_prep_xmit() method to achieve #a 2)use its dev-hard_end_xmit() method to achieve #d 3)#b and #c can stay in -hard_start_xmit() (or whichever way you want to do this) Note: There are drivers which may need not support any of the two methods (example the tun driver i patched) so the two methods are optional. The core will first do the packet formatting by invoking your supplied dev-hard_prep_xmit() method. It will then pass you the packet via your dev-hard_start_xmit() method and lastly will invoke your dev-hard_end_xmit() when it completes passing you all the packets queued for you. dev-hard_prep_xmit() is invoked without holding any tx lock but the rest are under TX_LOCK(). LLTX present a challenge in that we have to introduce a deviation from the norm and introduce the -hard_batch_xmit() method. An LLTX driver presents us with -hard_batch_xmit() to which we pass it a list of packets in a dev-blist skb queue. It is then the responsibility of the -hard_batch_xmit() to exercise steps #b and #c for all packets and #d when the batching is complete. Step #a is already done for you by the time you get the packets in dev-blist. And last xmit_win variable is introduced to ensure that when we pass the driver a list of packets it will swallow all of them - which is useful because we dont requeue to the qdisc (and avoids burning unnecessary cpu cycles or introducing any strange re-ordering). The driver tells us when it invokes netif_wake_queue how much space it has for descriptors by setting this variable. Some decisions i had to make: - every driver will have a xmit_win variable and the core will set it to 1 which means the behavior of non-batching drivers stays the same. - the batch list, blist, is no longer a pointer; wastes a little extra memmory i plan to recoup by killing gso_skb in later patches. Theres a lot of history and reasoning of why batching in a document i am writting which i may submit as a patch. Thomas Graf (who doesnt know this probably) gave me the impetus to start looking at this back in 2004 when he invited me to the linux conference he was organizing. Parts of what i presented in SUCON in 2004 talk about batching. Herbert Xu forced me to take a second look around 2.6.18 - refer to my netconf 2006 presentation. Krishna Kumar provided me with more motivation in May 2007 when he posted on netdev and engaged me. Sridhar Samudrala, Krishna Kumar, Matt Carlson, Michael Chan, Jeremy Ethridge, Evgeniy Polyakov, Sivakumar Subramani, and David Miller, have contributed in one or more of {bug fixes, enhancements, testing, lively discussion}. The Broadcom and netiron folks have been outstanding in their help. Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] --- commit 624a0bfeb971c9aa58496c7372df01f0ed750def tree c1c0ee53453392866a5241631a7502ce6569b2cc parent 260dbcc4b0195897c539c5ff79d95afdddeb3378 author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 14:23:31 -0400 committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 14:23:31 -0400 include/linux/netdevice.h | 17 +++ net/core/dev.c| 106 + 2 files changed, 123 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 91cd3f3..df1fb61 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -467,6 +467,7 @@ struct net_device #define NETIF_F_NETNS_LOCAL8192/* Does not change network namespaces */ #define NETIF_F_MULTI_QUEUE16384 /* Has multiple TX/RX queues */ #define NETIF_F_LRO32768 /* large receive offload */ +#define NETIF_F_BTX65536 /* Capable of batch tx */ /* Segmentation offload features */ #define NETIF_F_GSO_SHIFT 16 @@ -595,6 +596,15 @@ struct net_device void*priv; /* pointer to private data */ int (*hard_start_xmit) (struct sk_buff *skb, struct net_device *dev); + /* hard_batch_xmit is needed for LLTX, kill it when those +* disappear or better kill it now and dont support LLTX + */ + int
[ofa-general] [PATCH 2/3][NET_BATCH] net core use batching
This patch adds the usage of batching within the core. cheers, jamal [NET_BATCH] net core use batching This patch adds the usage of batching within the core. The same test methodology used in introducing txlock is used, with the following results on different kernels: ++--+-+++ | 64B | 128B| 256B| 512B |1024B | ++--+-+++ Original| 467482 | 463061 | 388267 | 216308 | 114704 | || | ||| txlock | 468922 | 464060 | 388298 | 216316 | 114709 | || | ||| tg3nobtx| 468012 | 464079 | 388293 | 216314 | 114704 | || | ||| tg3btxdr| 480794 | 475102 | 388298 | 216316 | 114705 | || | ||| tg3btxco| 481059 | 475423 | 388285 | 216308 | 114706 | ++--+-+++ The first two colums Original and txlock were introduced in an earlier patch and demonstrate a slight increase in performance with txlock. tg3nobtx shows the tg3 driver with no changes to support batching. The purpose of this test is to demonstrate the effect of introducing the core changes to a driver that doesnt support them. Although this patch brings down perfomance slightly compared to txlock for such netdevices, it is still better compared to just the original kernel. tg3btxdr demonstrates the effect of using -hard_batch_xmit() with tg3 driver. tg3btxco demonstrates the effect of letting the core do all the work. As can be seen the last two are not very different in performance. The difference is -hard_batch_xmit() introduces a new method which is intrusive. I have #if-0ed some of the old functions so the patch is more readable. Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] --- commit 9b4a8fb190278d388c0a622fb5529d184ac8c7dc tree 053e8dda02b5d26fe7cc778823306a8a526df513 parent 624a0bfeb971c9aa58496c7372df01f0ed750def author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 14:38:11 -0400 committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 14:38:11 -0400 net/sched/sch_generic.c | 127 +++ 1 files changed, 115 insertions(+), 12 deletions(-) diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 95ae119..86a3f9d 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -56,6 +56,7 @@ static inline int qdisc_qlen(struct Qdisc *q) return q-q.qlen; } +#if 0 static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev, struct Qdisc *q) { @@ -110,6 +111,97 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb, return ret; } +#endif + +static inline int handle_dev_cpu_collision(struct net_device *dev) +{ + if (unlikely(dev-xmit_lock_owner == smp_processor_id())) { + if (net_ratelimit()) + printk(KERN_WARNING + Dead loop on netdevice %s, fix it urgently!\n, + dev-name); + return 1; + } + __get_cpu_var(netdev_rx_stat).cpu_collision++; + return 0; +} + +static inline int +dev_requeue_skbs(struct sk_buff_head *skbs, struct net_device *dev, + struct Qdisc *q) +{ + + struct sk_buff *skb; + + while ((skb = __skb_dequeue(skbs)) != NULL) + q-ops-requeue(skb, q); + + netif_schedule(dev); + return 0; +} + +static inline int +xmit_islocked(struct sk_buff_head *skbs, struct net_device *dev, + struct Qdisc *q) +{ + int ret = handle_dev_cpu_collision(dev); + + if (ret) { + if (!skb_queue_empty(skbs)) + skb_queue_purge(skbs); + return qdisc_qlen(q); + } + + return dev_requeue_skbs(skbs, dev, q); +} + +static int xmit_count_skbs(struct sk_buff *skb) +{ + int count = 0; + for (; skb; skb = skb-next) { + count += skb_shinfo(skb)-nr_frags; + count += 1; + } + return count; +} + +static int xmit_get_pkts(struct net_device *dev, + struct Qdisc *q, + struct sk_buff_head *pktlist) +{ + struct sk_buff *skb; + int count = dev-xmit_win; + + if (count dev-gso_skb) { + skb = dev-gso_skb; + dev-gso_skb = NULL; + count -= xmit_count_skbs(skb); + __skb_queue_tail(pktlist, skb); + } + + while (count 0) { + skb = q-dequeue(q); + if (!skb) + break
[ofa-general] [PATCH 3/3][NET_SCHED] kill dev-gso_skb
This patch removes dev-gso_skb as it is no longer necessary with batching code. cheers, jamal [NET_SCHED] kill dev-gso_skb The batching code does what gso used to batch at the drivers. There is no more need for gso_skb. If for whatever reason the requeueing is a bad idea we are going to leave packets in dev-blist (and still not need dev-gso_skb) Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] --- commit c2916c550d228472ddcdd676c2689fa6c8ecfcc0 tree 5beaf197fd08a038d83501f405017f48712d0318 parent 9b4a8fb190278d388c0a622fb5529d184ac8c7dc author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 14:38:58 -0400 committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 14:38:58 -0400 include/linux/netdevice.h |3 --- net/sched/sch_generic.c | 12 2 files changed, 0 insertions(+), 15 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index df1fb61..cea400a 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -577,9 +577,6 @@ struct net_device struct list_headqdisc_list; unsigned long tx_queue_len; /* Max frames per queue allowed */ - /* Partially transmitted GSO packet. */ - struct sk_buff *gso_skb; - /* ingress path synchronizer */ spinlock_t ingress_lock; struct Qdisc*qdisc_ingress; diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 86a3f9d..b4e1607 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -172,13 +172,6 @@ static int xmit_get_pkts(struct net_device *dev, struct sk_buff *skb; int count = dev-xmit_win; - if (count dev-gso_skb) { - skb = dev-gso_skb; - dev-gso_skb = NULL; - count -= xmit_count_skbs(skb); - __skb_queue_tail(pktlist, skb); - } - while (count 0) { skb = q-dequeue(q); if (!skb) @@ -654,7 +647,6 @@ void dev_activate(struct net_device *dev) void dev_deactivate(struct net_device *dev) { struct Qdisc *qdisc; - struct sk_buff *skb; spin_lock_bh(dev-queue_lock); qdisc = dev-qdisc; @@ -662,15 +654,11 @@ void dev_deactivate(struct net_device *dev) qdisc_reset(qdisc); - skb = dev-gso_skb; - dev-gso_skb = NULL; if (!skb_queue_empty(dev-blist)) skb_queue_purge(dev-blist); dev-xmit_win = 1; spin_unlock_bh(dev-queue_lock); - kfree_skb(skb); - dev_watchdog_down(dev); /* Wait for outstanding dev_queue_xmit calls. */ ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 1/3] [NET_BATCH] Introduce batching interface
Fixed subject - should be 1/3 not 1/4 On Sun, 2007-30-09 at 14:51 -0400, jamal wrote: This patch introduces the netdevice interface for batching. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCHES] TX batching
And heres a patch that provides a sample of the usage for batching with tg3. Requires patch [TG3]Some cleanups i posted earlier. cheers, jamal diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c index 5a864bd..9aafb78 100644 --- a/drivers/net/tg3.c +++ b/drivers/net/tg3.c @@ -3103,6 +3103,13 @@ static inline u32 tg3_tx_avail(struct tg3 *tp) ((tp-tx_prod - tp-tx_cons) (TG3_TX_RING_SIZE - 1))); } +static inline void tg3_set_win(struct tg3 *tp) +{ + tp-dev-xmit_win = tg3_tx_avail(tp) - (MAX_SKB_FRAGS + 1); + if (tp-dev-xmit_win 1) + tp-dev-xmit_win = 1; +} + /* Tigon3 never reports partial packet sends. So we do not * need special logic to handle SKBs that have not had all * of their frags sent yet, like SunGEM does. @@ -3165,8 +3172,10 @@ static void tg3_tx(struct tg3 *tp) (tg3_tx_avail(tp) TG3_TX_WAKEUP_THRESH(tp { netif_tx_lock(tp-dev); if (netif_queue_stopped(tp-dev) - (tg3_tx_avail(tp) TG3_TX_WAKEUP_THRESH(tp))) + (tg3_tx_avail(tp) TG3_TX_WAKEUP_THRESH(tp))) { + tg3_set_win(tp); netif_wake_queue(tp-dev); + } netif_tx_unlock(tp-dev); } } @@ -4007,8 +4016,13 @@ void tg3_kick_DMA(struct net_device *dev) if (unlikely(tg3_tx_avail(tp) = (MAX_SKB_FRAGS + 1))) { netif_stop_queue(dev); - if (tg3_tx_avail(tp) = TG3_TX_WAKEUP_THRESH(tp)) + dev-xmit_win = 1; + if (tg3_tx_avail(tp) = TG3_TX_WAKEUP_THRESH(tp)) { + tg3_set_win(tp); netif_wake_queue(dev); + } + } else { + tg3_set_win(tp); } mmiowb(); @@ -4085,6 +4099,7 @@ static int tg3_start_xmit(struct sk_buff *skb, struct net_device *dev) if (unlikely(tg3_tx_avail(tp) = (skb_shinfo(skb)-nr_frags + 1))) { if (!netif_queue_stopped(dev)) { netif_stop_queue(dev); + tp-dev-xmit_win = 1; /* This is a hard error, log it. */ printk(KERN_ERR PFX %s: BUG! Tx Ring full when @@ -4100,6 +4115,25 @@ static int tg3_start_xmit(struct sk_buff *skb, struct net_device *dev) return ret; } +static int tg3_start_bxmit(struct sk_buff *skb, struct net_device *dev) +{ + struct tg3 *tp = netdev_priv(dev); + + if (unlikely(tg3_tx_avail(tp) = (skb_shinfo(skb)-nr_frags + 1))) { + if (!netif_queue_stopped(dev)) { + netif_stop_queue(dev); + dev-xmit_win = 1; + + /* This is a hard error, log it. */ + printk(KERN_ERR PFX %s: BUG! Tx Ring full when + queue awake!\n, dev-name); + } + return NETDEV_TX_BUSY; + } + + return tg3_enqueue(skb, dev); +} + static int tg3_start_xmit_dma_bug(struct sk_buff *, struct net_device *); /* Use GSO to workaround a rare TSO bug that may be triggered when the @@ -4112,9 +4146,11 @@ static int tg3_tso_bug(struct tg3 *tp, struct sk_buff *skb) /* Estimate the number of fragments in the worst case */ if (unlikely(tg3_tx_avail(tp) = (skb_shinfo(skb)-gso_segs * 3))) { netif_stop_queue(tp-dev); + tp-dev-xmit_win = 1; if (tg3_tx_avail(tp) = (skb_shinfo(skb)-gso_segs * 3)) return NETDEV_TX_BUSY; + tg3_set_win(tp); netif_wake_queue(tp-dev); } @@ -4267,6 +4303,25 @@ static int tg3_enqueue_buggy(struct sk_buff *skb, struct net_device *dev) return NETDEV_TX_OK; } +static int tg3_start_bxmit_dma_bug(struct sk_buff *skb, struct net_device *dev) +{ + struct tg3 *tp = netdev_priv(dev); + + if (unlikely(tg3_tx_avail(tp) = (skb_shinfo(skb)-nr_frags + 1))) { + if (!netif_queue_stopped(dev)) { + netif_stop_queue(dev); + dev-xmit_win = 1; + + /* This is a hard error, log it. */ + printk(KERN_ERR PFX %s: BUG! Tx Ring full when + queue awake!\n, dev-name); + } + return NETDEV_TX_BUSY; + } + + return tg3_enqueue_buggy(skb, dev); +} + static int tg3_start_xmit_dma_bug(struct sk_buff *skb, struct net_device *dev) { struct tg3 *tp = netdev_priv(dev); @@ -4283,6 +4338,7 @@ static int tg3_start_xmit_dma_bug(struct sk_buff *skb, struct net_device *dev) if (unlikely(tg3_tx_avail(tp) = (skb_shinfo(skb)-nr_frags + 1))) { if (!netif_queue_stopped(dev)) { netif_stop_queue(dev); + dev-xmit_win = 1; /* This is a hard error, log it. */ printk(KERN_ERR PFX %s: BUG! Tx Ring full when @@ -11099,15 +11155,19 @@ static int __devinit tg3_get_invariants(struct tg3 *tp) else tp-tg3_flags = ~TG3_FLAG_POLL_SERDES; + tp-dev-hard_end_xmit = tg3_kick_DMA; /* All chips before 5787 can get confused if TX buffers * straddle the 4GB address boundary in some cases. */ if (GET_ASIC_REV(tp-pci_chip_rev_id) == ASIC_REV_5755 || GET_ASIC_REV(tp-pci_chip_rev_id) == ASIC_REV_5787 || - GET_ASIC_REV(tp-pci_chip_rev_id) == ASIC_REV_5906) - tp-dev-hard_start_xmit = tg3_start_xmit; - else - tp-dev-hard_start_xmit = tg3_start_xmit_dma_bug; + GET_ASIC_REV(tp-pci_chip_rev_id) == ASIC_REV_5906) { + tp-dev-hard_start_xmit = tg3_start_bxmit; + tp-dev-hard_prep_xmit = tg3_prep_frame; + } else { + tp-dev-hard_start_xmit = tg3_start_bxmit_dma_bug; + tp-dev-hard_prep_xmit = tg3_prep_bug_frame; + } tp-rx_offset = 2; if (GET_ASIC_REV(tp-pci_chip_rev_id) == ASIC_REV_5701 @@ -11955,6 +12015,8
[ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
On Mon, 2007-24-09 at 16:47 -0700, Waskiewicz Jr, Peter P wrote: We should make sure we're symmetric with the locking on enqueue to dequeue. If we use the single device queue lock on enqueue, then dequeue will also need to check that lock in addition to the individual queue lock. The details of this are more trivial than the actual dequeue to make it efficient though. It would be interesting to observe the performance implications. The dequeue locking would be pushed into the qdisc itself. This is how I had it originally, and it did make the code more complex, but it was successful at breaking the heavily-contended queue_lock apart. I have a subqueue structure right now in netdev, which only has queue_state (for netif_{start|stop}_subqueue). This state is checked in sch_prio right now in the dequeue for both prio and rr. My approach is to add a queue_lock in that struct, so each queue allocated by the driver would have a lock per queue. Then in dequeue, that lock would be taken when the skb is about to be dequeued. more locks implies degraded performance. If only one processor can enter that region, presumably after acquiring the outer lock , why this secondary lock per queue? The skb-queue_mapping field also maps directly to the queue index itself, so it can be unlocked easily outside of the context of the dequeue function. The policy would be to use a spin_trylock() in dequeue, so that dequeue can still do work if enqueue or another dequeue is busy. So there could be a parallel cpu dequeueing at the same time? Wouldnt this have implications depending on what the scheduling algorithm used? If for example i was doing priority queueing i would want to make sure the highest priority is being dequeued first AND by all means goes out first to the driver; i dont want a parallell cpu dequeing a lower prio packet at the same time. And the allocation of qdisc queues to device queues is assumed to be one-to-one (that's how the qdisc behaves now). Ok, that brings back the discussion we had; my thinking was something like dev-hard_prep_xmit() would select the ring and i think you staticly already map the ring to a qdisc queue. So i dont think dev-hard_prep_xmit() is useful to you. In any case, there is nothing the batching patches do that interfere or prevent you from going the path you intend to. instead of dequeueing one packet, you dequeue several and instead of sending to the driver one packet, you send several. And using the xmit_win, you should never ever have to requeue. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
On Mon, 2007-24-09 at 17:14 -0700, Stephen Hemminger wrote: Since we are redoing this, is there any way to make the whole TX path more lockless? The existing model seems to be more of a monitor than a real locking model. What do you mean it is more of a monitor? On the challenge of making it lockless: About every NAPI driver combines the tx prunning with rx polling. If you are dealing with tx resources on receive thread as well as tx thread, _you need_ locking. The only other way we can do avoid it is to separate the rx path interupts from ones on tx related resources; the last NAPI driver that did that was tulip; i think the e1000 for a short period in its life did the same as well. But that has been frowned on and people have evolved away from it. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
On Tue, 2007-25-09 at 08:24 -0700, Stephen Hemminger wrote: The transmit code path is locked as a code region, rather than just object locking on the transmit queue or other fine grained object. This leads to moderately long lock hold times when multiple qdisc's and classification is being done. It will be pretty tricky to optimize that path given the dependencies between the queues, classifiers, and actions in enqueues; schedulers in dequeues as well as their config/queries from user space which could happen concurently on all N CPUs. The txlock optimization i added in patch1 intends to let go of the queue lock when we enter the dequeue region sooner to reduce the contention. A further optimization i made was to reduce the time it takes to hold the tx lock at the driver by moving gunk that doesnt need lock-holding into the new method dev-hard_end_xmit() (refer to patch #2) If we went to finer grain locking it would also mean changes to all network devices using the new locking model. My assumption is that we would use something like the features flag to do the transition for backward compatibility. Take this as a purely what if or it would be nice if kind of suggestion not a requirement or some grand plan. Ok, hopefully someone would demonstrate how to achieve it; seems a hard thing to achieve. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [DOC] Net batching driver howto
On Tue, 2007-25-09 at 13:16 -0700, Randy Dunlap wrote: On Mon, 24 Sep 2007 18:54:19 -0400 jamal wrote: I have updated the driver howto to match the patches i posted yesterday. attached. Thanks for sending this. Thank you for reading it Randy. This is an early draft, right? Its a third revision - but you could call it early. When it is done, i will probably put a pointer to it in some patch. I'll fix some typos etc. in it (patch attached) and add some whitespace. Please see RD: in the patch for more questions/comments. Thanks, will do and changes will show up in the next update. IMO it needs some changes to eliminate words like we, you, and your (words that personify code). Those words are OK when talking about yourself. The narrative intent is supposed to be i (or someone doing the description) sitting there with a pen and paper and maybe a laptop and walking through the details with someone who needs to understand those details. If you think it is important to make it formal, then by all means be my guest. Again, thanks for taking the time. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
On Tue, 2007-25-09 at 18:15 -0400, jamal wrote: A further optimization i made was to reduce the time it takes to hold the tx lock at the driver by moving gunk that doesnt need lock-holding into the new method dev-hard_end_xmit() (refer to patch #2) Sorry, that should have read dev-hard_prep_xmit() cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
On Mon, 2007-24-09 at 12:12 -0700, Waskiewicz Jr, Peter P wrote: Hi Jamal, I've been (slowly) working on resurrecting the original design of my multiqueue patches to address this exact issue of the queue_lock being a hot item. I added a queue_lock to each queue in the subqueue struct, and in the enqueue and dequeue, just lock that queue instead of the global device queue_lock. The only two issues to overcome are the QDISC_RUNNING state flag, since that also serializes entry into the qdisc_restart() function, and the qdisc statistics maintenance, which needs to be serialized. Do you think this work along with your patch will benefit from one another? The one thing that seems obvious is to use dev-hard_prep_xmit() in the patches i posted to select the xmit ring. You should be able to do figure out the txmit ring without holding any lock. I lost track of how/where things went since the last discussion; so i need to wrap my mind around it to make sensisble suggestions - I know the core patches are in the kernel but havent paid attention to details and if you look at my second patch youd see a comment in dev_batch_xmit() which says i need to scrutinize multiqueue more. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [DOC] Net batching driver howto
I have updated the driver howto to match the patches i posted yesterday. attached. cheers, jamal Heres the begining of a howto for driver authors. The intended audience for this howto is people already familiar with netdevices. 1.0 Netdevice Pre-requisites -- For hardware based netdevices, you must have at least hardware that is capable of doing DMA with many descriptors; i.e having hardware with a queue length of 3 (as in some fscked ethernet hardware) is not very useful in this case. 2.0 What is new in the driver API --- There are 3 new methods and one new variable introduced. These are: 1)dev-hard_prep_xmit() 2)dev-hard_end_xmit() 3)dev-hard_batch_xmit() 4)dev-xmit_win 2.1 Using Core driver changes - To provide context, lets look at a typical driver abstraction for dev-hard_start_xmit(). It has 4 parts: a) packet formating (example vlan, mss, descriptor counting etc) b) chip specific formatting c) enqueueing the packet on a DMA ring d) IO operations to complete packet transmit, tell DMA engine to chew on, tx completion interupts etc [For code cleanliness/readability sake, regardless of this work, one should break the dev-hard_start_xmit() into those 4 functions anyways]. A driver which has all 4 parts and needing to support batching is advised to split its dev-hard_start_xmit() in the following manner: 1)use its dev-hard_prep_xmit() method to achieve #a 2)use its dev-hard_end_xmit() method to achieve #d 3)#b and #c can stay in -hard_start_xmit() (or whichever way you want to do this) Note: There are drivers which may need not support any of the two methods (example the tun driver i patched) so the two methods are essentially optional. 2.1.1 Theory of operation -- The core will first do the packet formatting by invoking your supplied dev-hard_prep_xmit() method. It will then pass you the packet via your dev-hard_start_xmit() method for as many as packets you have advertised (via dev-xmit_win) you can consume. Lastly it will invoke your dev-hard_end_xmit() when it completes passing you all the packets queued for you. 2.1.1.1 Locking rules - dev-hard_prep_xmit() is invoked without holding any tx lock but the rest are under TX_LOCK(). So you have to ensure that whatever you put it dev-hard_prep_xmit() doesnt require locking. 2.1.1.2 The slippery LLTX - LLTX drivers present a challenge in that we have to introduce a deviation from the norm and require the -hard_batch_xmit() method. An LLTX driver presents us with -hard_batch_xmit() to which we pass it a list of packets in a dev-blist skb queue. It is then the responsibility of the -hard_batch_xmit() to exercise steps #b and #c for all packets passed in the dev-blist. Step #a and #d are done by the core should you register presence of dev-hard_prep_xmit() and dev-hard_end_xmit() in your setup. 2.1.1.3 xmit_win dev-xmit_win variable is set by the driver to tell us how much space it has in its rings/queues. dev-xmit_win is introduced to ensure that when we pass the driver a list of packets it will swallow all of them - which is useful because we dont requeue to the qdisc (and avoids burning unnecessary cpu cycles or introducing any strange re-ordering). The driver tells us, whenever it invokes netif_wake_queue, how much space it has for descriptors by setting this variable. 3.0 Driver Essentials - The typical driver tx state machine is: -1- +Core sends packets +-- Driver puts packet onto hardware queue +if hardware queue is full, netif_stop_queue(dev) + -2- +core stops sending because of netif_stop_queue(dev) .. .. time passes ... .. -3- +--- driver has transmitted packets, opens up tx path by invoking netif_wake_queue(dev) -1- +Cycle repeats and core sends more packets (step 1). 3.1 Driver pre-requisite -- This is _a very important_ requirement in making batching useful. The pre-requisite for batching changes is that the driver should provide a low threshold to open up the tx path. Drivers such as tg3 and e1000 already do this. Before you invoke netif_wake_queue(dev) you check if there is a threshold of space reached to insert new packets. Heres an example of how i added it to tun driver. Observe the setting of dev-xmit_win --- +#define NETDEV_LTT 4 /* the low threshold to open up the tx path */ .. .. u32 t = skb_queue_len(tun-readq); if (netif_queue_stopped(tun-dev) t NETDEV_LTT) { tun-dev-xmit_win = tun-dev-tx_queue_len; netif_wake_queue(tun-dev); } --- Heres how the batching e1000 driver does it: -- if (unlikely(cleaned netif_carrier_ok(netdev) E1000_DESC_UNUSED(tx_ring) = TX_WAKE_THRESHOLD)) { if (netif_queue_stopped(netdev)) { int rspace = E1000_DESC_UNUSED(tx_ring
[ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
On Mon, 2007-24-09 at 15:57 -0700, Waskiewicz Jr, Peter P wrote: I've looked at that as a candidate to use. The lock for enqueue would be needed when actually placing the skb into the appropriate software queue for the qdisc, so it'd be quick. The enqueue is easy to comprehend. The single device queue lock should suffice. The dequeue is interesting: Maybe you can point me to some doc or describe to me the dequeue aspect; are you planning to have an array of txlocks per, one per ring? How is the policy to define the qdisc queues locked/mapped to tx rings? cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCHES] TX batching
I had plenty of time this weekend so i have been doing a _lot_ of testing. My next emails will send a set of patches: Patch 1: Introduces explicit tx locking Patch 2: Introduces batching interface Patch 3: Core uses batching interface Patch 4: get rid of dev-gso_skb Testing --- Each of these patches has been performance tested and the results are in the logs on a per-patch basis. My system under test hardware is a 2xdual core opteron with a couple of tg3s. My test tool generates udp traffic of different sizes for upto 60 seconds per run or a total of 30M packets. I have 4 threads each running on a specific CPU which keep all the CPUs as busy as they can sending packets targetted at a directly connected box's udp discard port. All 4 CPUs target a single tg3 to send. The receiving box has a tc rule which counts and drops all incoming udp packets to discard port - this allows me to make sure that the receiver is not the bottleneck in the testing. Packet sizes sent are {64B, 128B, 256B, 512B, 1024B}. Each packet size run is repeated 10 times to ensure that there are no transients. The average of all 10 runs is then computed and collected. I have not run testing on patch #4 because i had to let the machine go, but will have some access to it tommorow early morning where i can run some tests. Comments Iam trying to kill -hard_batch_xmit() but it would be tricky to do without it for LLTX drivers. Anything i try will require a few extra checks. OTOH, I could kill LLTX for the drivers i am using that are LLTX and then drop that interface or I could say no support for LLTX. I am in a dilema. Dave please let me know if this meets your desires to allow devices which are SG and able to compute CSUM benefit just in case i misunderstood. Herbert, if you can look at at least patch 4 i will appreaciate it. More patches to follow - i didnt want to overload people by dumping too many patches. Most of these patches below are ready to go; some are need some testing and others need a little porting from an earlier kernel: - tg3 driver (tested and works well, but dont want to send - tun driver - pktgen - netiron driver - e1000 driver - ethtool interface - There is at least one other driver promised to me I am also going to update the two documents i posted earlier. Hopefully i can do that today. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
I have submitted this before; but here it is again. Against net-2.6.24 from yesterday for this and all following patches. cheers, jamal [NET_SCHED] explict hold dev tx lock For N cpus, with full throttle traffic on all N CPUs, funneling traffic to the same ethernet device, the devices queue lock is contended by all N CPUs constantly. The TX lock is only contended by a max of 2 CPUS. In the current mode of qdisc operation, when all N CPUs contend for the dequeue region and one of them (after all the work) entering dequeue region, we may endup aborting the path if we are unable to get the tx lock and go back to contend for the queue lock. As N goes up, this gets worse. The changes in this patch result in a small increase in performance with a 4CPU (2xdual-core) with no irq binding. My tests are UDP based and keep all 4CPUs busy all the time for the period of the test. Both e1000 and tg3 showed similar behavior. I expect higher gains with more CPUs. Summary below with different UDP packets and the resulting pps seen. Note at around 200Bytes, the two dont seem that much different and we are approaching wire speed (with plenty of CPU available; eg at 512B, the app is sitting at 80% idle on both cases). ++--+-+++ pktsize | 64B | 128B| 256B| 512B |1024B | ++--+-+++ Original| 467482 | 463061 | 388267 | 216308 | 114704 | || | ||| txlock | 468922 | 464060 | 388298 | 216316 | 114709 | - Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] --- commit b0e36991c5850dfe930f80ee508b08fdcabc18d1 tree b1787bba26f80a325298f89d1ec882cc5ab524ae parent 42765047105fdd496976bc1784d22eec1cd9b9aa author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 23 Sep 2007 09:09:17 -0400 committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 23 Sep 2007 09:09:17 -0400 net/sched/sch_generic.c | 19 ++- 1 files changed, 2 insertions(+), 17 deletions(-) diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index e970e8e..95ae119 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -134,34 +134,19 @@ static inline int qdisc_restart(struct net_device *dev) { struct Qdisc *q = dev-qdisc; struct sk_buff *skb; - unsigned lockless; int ret; /* Dequeue packet */ if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL)) return 0; - /* -* When the driver has LLTX set, it does its own locking in -* start_xmit. These checks are worth it because even uncongested -* locks can be quite expensive. The driver can do a trylock, as -* is being done here; in case of lock contention it should return -* NETDEV_TX_LOCKED and the packet will be requeued. -*/ - lockless = (dev-features NETIF_F_LLTX); - - if (!lockless !netif_tx_trylock(dev)) { - /* Another CPU grabbed the driver tx lock */ - return handle_dev_cpu_collision(skb, dev, q); - } /* And release queue */ spin_unlock(dev-queue_lock); + HARD_TX_LOCK(dev, smp_processor_id()); ret = dev_hard_start_xmit(skb, dev); - - if (!lockless) - netif_tx_unlock(dev); + HARD_TX_UNLOCK(dev); spin_lock(dev-queue_lock); q = dev-qdisc; ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 2/4] [NET_BATCH] Introduce batching interface
This patch introduces the netdevice interface for batching. cheers, jamal [NET_BATCH] Introduce batching interface This patch introduces the netdevice interface for batching. A typical driver dev-hard_start_xmit() has 4 parts: a) packet formating (example vlan, mss, descriptor counting etc) b) chip specific formatting c) enqueueing the packet on a DMA ring d) IO operations to complete packet transmit, tell DMA engine to chew on, tx completion interupts etc [For code cleanliness/readability sake, regardless of this work, one should break the dev-hard_start_xmit() into those 4 functions anyways]. With the api introduced in this patch, a driver which has all 4 parts and needing to support batching is advised to split its dev-hard_start_xmit() in the following manner: 1)use its dev-hard_prep_xmit() method to achieve #a 2)use its dev-hard_end_xmit() method to achieve #d 3)#b and #c can stay in -hard_start_xmit() (or whichever way you want to do this) Note: There are drivers which may need not support any of the two methods (example the tun driver i patched) so the two methods are optional. The core will first do the packet formatting by invoking your supplied dev-hard_prep_xmit() method. It will then pass you the packet via your dev-hard_start_xmit() method and lastly will invoke your dev-hard_end_xmit() when it completes passing you all the packets queued for you. dev-hard_prep_xmit() is invoked without holding any tx lock but the rest are under TX_LOCK(). LLTX present a challenge in that we have to introduce a deviation from the norm and introduce the -hard_batch_xmit() method. An LLTX driver presents us with -hard_batch_xmit() to which we pass it a list of packets in a dev-blist skb queue. It is then the responsibility of the -hard_batch_xmit() to exercise steps #b and #c for all packets and #d when the batching is complete. Step #a is already done for you by the time you get the packets in dev-blist. And last xmit_win variable is introduced to ensure that when we pass the driver a list of packets it will swallow all of them - which is useful because we dont requeue to the qdisc (and avoids burning unnecessary cpu cycles or introducing any strange re-ordering). The driver tells us when it invokes netif_wake_queue how much space it has for descriptors by setting this variable. Some decisions i had to make: - every driver will have a xmit_win variable and the core will set it to 1 which means the behavior of non-batching drivers stays the same. - the batch list, blist, is no longer a pointer; wastes a little extra memmory i plan to recoup by killing gso_skb in later patches. Theres a lot of history and reasoning of why batching in a document i am writting which i may submit as a patch. Thomas Graf (who doesnt know this probably) gave me the impetus to start looking at this back in 2004 when he invited me to the linux conference he was organizing. Parts of what i presented in SUCON in 2004 talk about batching. Herbert Xu forced me to take a second look around 2.6.18 - refer to my netconf 2006 presentation. Krishna Kumar provided me with more motivation in May 2007 when he posted on netdev and engaged me. Sridhar Samudrala, Krishna Kumar, Matt Carlson, Michael Chan, Jeremy Ethridge, Evgeniy Polyakov, Sivakumar Subramani, and David Miller, have contributed in one or more of {bug fixes, enhancements, testing, lively discussion}. The Broadcom and netiron folks have been outstanding in their help. Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] --- commit ab4b07ef2e4069c115c9c1707d86ae2344a5ded5 tree 994b42b03bbfcc09ac8b7670c53c12e0b2a71dc7 parent b0e36991c5850dfe930f80ee508b08fdcabc18d1 author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 23 Sep 2007 10:30:32 -0400 committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 23 Sep 2007 10:30:32 -0400 include/linux/netdevice.h | 17 +++ net/core/dev.c| 106 + 2 files changed, 123 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index cf89ce6..443cded 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -453,6 +453,7 @@ struct net_device #define NETIF_F_NETNS_LOCAL8192/* Does not change network namespaces */ #define NETIF_F_MULTI_QUEUE16384 /* Has multiple TX/RX queues */ #define NETIF_F_LRO32768 /* large receive offload */ +#define NETIF_F_BTX65536 /* Capable of batch tx */ /* Segmentation offload features */ #define NETIF_F_GSO_SHIFT 16 @@ -578,6 +579,15 @@ struct net_device void*priv; /* pointer to private data */ int (*hard_start_xmit) (struct sk_buff *skb, struct net_device *dev); + /* hard_batch_xmit is needed for LLTX, kill it when those +* disappear or better kill it now and dont support LLTX + */ + int
[ofa-general] [PATCH 3/4][NET_BATCH] net core use batching
This patch adds the usage of batching within the core. cheers, jamal [NET_BATCH] net core use batching This patch adds the usage of batching within the core. The same test methodology used in introducing txlock is used, with the following results on different kernels: ++--+-+++ | 64B | 128B| 256B| 512B |1024B | ++--+-+++ Original| 467482 | 463061 | 388267 | 216308 | 114704 | || | ||| txlock | 468922 | 464060 | 388298 | 216316 | 114709 | || | ||| tg3nobtx| 468012 | 464079 | 388293 | 216314 | 114704 | || | ||| tg3btxdr| 480794 | 475102 | 388298 | 216316 | 114705 | || | ||| tg3btxco| 481059 | 475423 | 388285 | 216308 | 114706 | ++--+-+++ The first two colums Original and txlock were introduced in an earlier patch and demonstrate a slight increase in performance with txlock. tg3nobtx shows the tg3 driver with no changes to support batching. The purpose of this test is to demonstrate the effect of introducing the core changes to a driver that doesnt support them. Although this patch brings down perfomance slightly compared to txlock for such netdevices, it is still better compared to just the original kernel. tg3btxdr demonstrates the effect of using -hard_batch_xmit() with tg3 driver. tg3btxco demonstrates the effect of letting the core do all the work. As can be seen the last two are not very different in performance. The difference is -hard_batch_xmit() introduces a new method which is intrusive. I have #if-0ed some of the old functions so the patch is more readable. Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] --- commit e26705f6ef7db034df7af3f4fccd7cd40b8e46e0 tree b99c469497a0145ca5c0651dc4229ce17da5b31c parent 6b8e2f76f86c35a6b2cee3698c633d20495ae0c0 author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 23 Sep 2007 11:35:25 -0400 committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 23 Sep 2007 11:35:25 -0400 net/sched/sch_generic.c | 127 +++ 1 files changed, 115 insertions(+), 12 deletions(-) diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 95ae119..86a3f9d 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -56,6 +56,7 @@ static inline int qdisc_qlen(struct Qdisc *q) return q-q.qlen; } +#if 0 static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev, struct Qdisc *q) { @@ -110,6 +111,97 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb, return ret; } +#endif + +static inline int handle_dev_cpu_collision(struct net_device *dev) +{ + if (unlikely(dev-xmit_lock_owner == smp_processor_id())) { + if (net_ratelimit()) + printk(KERN_WARNING + Dead loop on netdevice %s, fix it urgently!\n, + dev-name); + return 1; + } + __get_cpu_var(netdev_rx_stat).cpu_collision++; + return 0; +} + +static inline int +dev_requeue_skbs(struct sk_buff_head *skbs, struct net_device *dev, + struct Qdisc *q) +{ + + struct sk_buff *skb; + + while ((skb = __skb_dequeue(skbs)) != NULL) + q-ops-requeue(skb, q); + + netif_schedule(dev); + return 0; +} + +static inline int +xmit_islocked(struct sk_buff_head *skbs, struct net_device *dev, + struct Qdisc *q) +{ + int ret = handle_dev_cpu_collision(dev); + + if (ret) { + if (!skb_queue_empty(skbs)) + skb_queue_purge(skbs); + return qdisc_qlen(q); + } + + return dev_requeue_skbs(skbs, dev, q); +} + +static int xmit_count_skbs(struct sk_buff *skb) +{ + int count = 0; + for (; skb; skb = skb-next) { + count += skb_shinfo(skb)-nr_frags; + count += 1; + } + return count; +} + +static int xmit_get_pkts(struct net_device *dev, + struct Qdisc *q, + struct sk_buff_head *pktlist) +{ + struct sk_buff *skb; + int count = dev-xmit_win; + + if (count dev-gso_skb) { + skb = dev-gso_skb; + dev-gso_skb = NULL; + count -= xmit_count_skbs(skb); + __skb_queue_tail(pktlist, skb); + } + + while (count 0) { + skb = q-dequeue(q); + if (!skb) + break
[ofa-general] [PATCH 4/4][NET_SCHED] kill dev-gso_skb
This patch removes dev-gso_skb as it is no longer necessary with batching code. cheers, jamal [NET_SCHED] kill dev-gso_skb The batching code does what gso used to batch at the drivers. There is no more need for gso_skb. If for whatever reason the requeueing is a bad idea we are going to leave packets in dev-blist (and still not need dev-gso_skb) Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] --- commit c6d2d61a73e1df5daaa294876f62454413fcb0af tree 1d7bf650096a922a6b6a4e7d6810f83320eb94dd parent e26705f6ef7db034df7af3f4fccd7cd40b8e46e0 author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 23 Sep 2007 12:25:10 -0400 committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 23 Sep 2007 12:25:10 -0400 include/linux/netdevice.h |3 --- net/sched/sch_generic.c | 12 2 files changed, 0 insertions(+), 15 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 443cded..7811729 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -560,9 +560,6 @@ struct net_device struct list_headqdisc_list; unsigned long tx_queue_len; /* Max frames per queue allowed */ - /* Partially transmitted GSO packet. */ - struct sk_buff *gso_skb; - /* ingress path synchronizer */ spinlock_t ingress_lock; struct Qdisc*qdisc_ingress; diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 86a3f9d..b4e1607 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -172,13 +172,6 @@ static int xmit_get_pkts(struct net_device *dev, struct sk_buff *skb; int count = dev-xmit_win; - if (count dev-gso_skb) { - skb = dev-gso_skb; - dev-gso_skb = NULL; - count -= xmit_count_skbs(skb); - __skb_queue_tail(pktlist, skb); - } - while (count 0) { skb = q-dequeue(q); if (!skb) @@ -654,7 +647,6 @@ void dev_activate(struct net_device *dev) void dev_deactivate(struct net_device *dev) { struct Qdisc *qdisc; - struct sk_buff *skb; spin_lock_bh(dev-queue_lock); qdisc = dev-qdisc; @@ -662,15 +654,11 @@ void dev_deactivate(struct net_device *dev) qdisc_reset(qdisc); - skb = dev-gso_skb; - dev-gso_skb = NULL; if (!skb_queue_empty(dev-blist)) skb_queue_purge(dev-blist); dev-xmit_win = 1; spin_unlock_bh(dev-queue_lock); - kfree_skb(skb); - dev_watchdog_down(dev); /* Wait for outstanding dev_queue_xmit calls. */ ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCHES] TX batching
On Sun, 2007-23-09 at 12:36 -0700, Kok, Auke wrote: please be reminded that we're going to strip down e1000 and most of the features should go into e1000e, which has much less hardware workarounds. I'm still reluctant to putting in new stuff in e1000 - I really want to chop it down first ;) sure - the question then is, will you take those changes if i use e1000e? theres a few cleanups that have nothing to do with batching; take a look at the modified e1000 on the git tree. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000
On Sun, 2007-16-09 at 20:13 -0700, David Miller wrote: What Herbert and I want to do is basically turn on TSO for devices that can't do it in hardware, and rely upon the GSO framework to do the segmenting in software right before we hit the device. Sensible. This only makes sense for devices which can 1) scatter-gather and 2) checksum on transmit. If you have knowledge there are enough descriptors in the driver to cover all skbs you are passing, do you need to have #1? Note i dont touch fragments, i am assuming the driver is smart enough to handle them otherwise it wont advertise it can handle scatter-gather Otherwise we make too many copies and/or passes over the data. I didnt understand this last bit - you are still going to go over the list regardless of whether you call -hard_start_xmit() once or multiple times over the same list, no? In the later case i am assuming a trimmed down -hard_start_xmit() UDP is too easy a test case in fact :-) I learnt a lot about the behavior out of doing udp (and before that with pktgen); theres a lot of driver habits that may need to be tuned before batching becomes really effective - which is easier to see with udp than with tcp. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000
On Sun, 2007-16-09 at 18:02 -0700, David Miller wrote: I do. And I'm reviewing and applying several hundred patches a day. What's the point? :-) Reading the commentary made me think you were about to swallow that with one more change by the time i wake up;- I still think this work - despite my vested interest - needs more scrutiny from a performance perspective. I tend to send a url to my work, but it may be time to start posting patches. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000
On Sun, 2007-16-09 at 19:25 -0700, David Miller wrote: There are tertiary issues I'm personally interested in, for example how well this stuff works when we enable software GSO on a non-TSO capable card. In such a case the GSO segment should be split right before we hit the driver and then all the sub-segments of the original GSO frame batched in one shot down to the device driver. I think GSO is still useful on top of this. In my patches anything with gso gets put into the batch list and shot down the driver. Ive never considered checking whether the nic is TSO capable, that may be something worth checking into. The netiron allows you to shove upto 128 skbs utilizing one tx descriptor, which makes for interesting possibilities. In this way you'll get a large chunk of the benefit of TSO without explicit hardware support for the feature. There are several cards (some even 10GB) that will benefit immensely from this. indeed - ive always wondered if batching this way would make the NICs behave differently from the way TSO does. On a side note: My observation is that with large packets on a very busy system; bulk transfer type app, one approaches wire speed; with or without batching, the apps are mostly idling (Ive seen upto 90% idle time polling at the socket level for write to complete with a really busy system). This is the case with or without batching. cpu seems a little better with batching. As the aggregation of the apps gets more aggressive (achievable by reducing their packet sizes), one can achieve improved throughput and reduced cpu utilization. This all with UDP; i am still studying tcp. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] TSO, TCP Cong control etc
Ive changed the subject to match content.. On Fri, 2007-14-09 at 03:20 -0400, Bill Fink wrote: On Mon, 27 Aug 2007, jamal wrote: Bill: who suggested (as per your email) the 75usec value and what was it based on measurement-wise? Belatedly getting back to this thread. There was a recent myri10ge patch that changed the default value for tx/rx interrupt coalescing to 75 usec claiming it was an optimum value for maximum throughput (and is also mentioned in their external README documentation). I would think such a value would be very specific to the ring size and maybe even the machine in use. I also did some empirical testing to determine the effect of different values of TX/RX interrupt coalescing on 10-GigE network performance, both with TSO enabled and with TSO disabled. The actual test runs are attached at the end of this message, but the results are summarized in the following table (network performance in Mbps). TX/RX interrupt coalescing in usec (both sides) 0 15 30 45 60 75 90 105 TSO enabled 89099682971697259739974596889648 TSO disabled 91139910991099109910991099109910 TSO disabled performance is always better than equivalent TSO enabled performance. With TSO enabled, the optimum performance is indeed at a TX/RX interrupt coalescing value of 75 usec. With TSO disabled, performance is the full 10-GigE line rate of 9910 Mbps for any value of TX/RX interrupt coalescing from 15 usec to 105 usec. Interesting results. I think J Heffner made a very compelling description the other day based on your netstat results at the receiver as to what is going on (refer to the comments on stretch ACKs). If the receiver is fixed, then youd see better numbers from TSO. The 75 microsecs is very benchmarky in my opinion. If i was to pick a different app or different NIC or run on many cpus with many apps doing TSO, i highly doubt that will be the right number. Here's a retest (5 tests each): TSO enabled: TCP Cubic (initial_ssthresh set to 0): [..] TCP Bic (initial_ssthresh set to 0): [..] TCP Reno: [..] TSO disabled: TCP Cubic (initial_ssthresh set to 0): [..] TCP Bic (initial_ssthresh set to 0): [..] TCP Reno: [..] Not too much variation here, and not quite as high results as previously. BIC seems to be on average better followed by CUBIC followed by Reno. The difference this time maybe because you set the ssthresh to 0 (hopefully every run) and so Reno is definetely going to perform less better since it is a lot less agressive in comparison to other two. Some further testing reveals that while this time I mainly get results like (here for TCP Bic with TSO disabled): [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 4958.0625 MB / 10.02 sec = 4148.9361 Mbps 100 %TX 99 %RX I also sometimes get results like: [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16 5882.1875 MB / 10.00 sec = 4932.5549 Mbps 100 %TX 90 %RX not good. The higher performing results seem to correspond to when there's a somewhat lower receiver CPU utilization. I'm not sure but there could also have been an effect from running the -M1460 test after the 9000 byte jumbo frame test (no jumbo tests were done at all prior to running the above sets of 5 tests, although I did always discard an initial warmup test, and now that I think about it some of those initial discarded warmup tests did have somewhat anomalously high results). If you didnt reset the ssthresh on every run, could it have been cached and used on subsequent runs? A side note: Although the experimentation reduces the variables (eg tying all to CPU0), it would be more exciting to see multi-cpu and multi-flow sender effect (which IMO is more real world). These systems are intended as test systems for 10-GigE networks, and as such it's important to get as consistently close to full 10-GigE line rate as possible, and that's why the interrupts and nuttcp application are tied to CPU0, with almost all other system applications tied to CPU1. Sure, good benchmark. You get to know how well you can do. Now on another system that's intended as a 10-GigE firewall system, it has 2 Myricom 10-GigE NICs with the interrupts for eth2 tied to CPU0 and the interrupts for CPU1 tied to CPU1. In IP forwarding tests of this system, I have basically achieved full bidirectional 10-GigE line rate IP forwarding with 9000 byte jumbo frames. In forwarding a more meaningful metric would be pps. The cost per packet tends to dominate the results over the cost/byte. 9K jumbo frames at 10G is less than 500Kpps - so i dont see that machine you are using sweating at all. To give you a comparison on a lower end opteron a single CPU i can generate with batching pktgen 1Mpps; Robert says he can do that even without batching on an opteron closer to what you
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
On Sun, 2007-26-08 at 19:04 -0700, David Miller wrote: The transfer is much better behaved if we ACK every two full sized frames we copy into the receiver, and therefore don't stretch ACK, but at the cost of cpu utilization. The rx coalescing in theory should help by accumulating more ACKs on the rx side of the sender. But it doesnt seem to do that i.e For the 9K MTU, you are better off to turn off the coalescing if you want higher numbers. Also some of the TOE vendors (chelsio?) claim to have fixed this by reducing bursts on outgoing packets. Bill: who suggested (as per your email) the 75usec value and what was it based on measurement-wise? BTW, thanks for the finding the energy to run those tests and a very refreshing perspective. I dont mean to add more work, but i had some queries; On your earlier tests, i think that Reno showed some significant differences on the lower MTU case over BIC. I wonder if this is consistent? A side note: Although the experimentation reduces the variables (eg tying all to CPU0), it would be more exciting to see multi-cpu and multi-flow sender effect (which IMO is more real world). Last note: you need a newer netstat. These effects are particularly pronounced on systems where the bus bandwidth is also one of the limiting factors. Can you elucidate this a little more Dave? Did you mean memory bandwidth? cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
On Thu, 2007-23-08 at 23:18 -0400, Bill Fink wrote: [..] Here you can see there is a major difference in the TX CPU utilization (99 % with TSO disabled versus only 39 % with TSO enabled), although the TSO disabled case was able to squeeze out a little extra performance from its extra CPU utilization. Good stuff. What kind of machine? SMP? Seems the receive side of the sender is also consuming a lot more cpu i suspect because receiver is generating a lot more ACKs with TSO. Does the choice of the tcp congestion control algorithm affect results? it would be interesting to see both MTUs with either TCP BIC vs good old reno on sender (probably without changing what the receiver does). BIC seems to be the default lately. Interestingly, with TSO enabled, the receiver actually consumed more CPU than with TSO disabled, I would suspect the fact that a lot more packets making it into the receiver for TSO contributes. so I guess the receiver CPU saturation in that case (99 %) was what restricted its performance somewhat (this was consistent across a few test runs). Unfortunately the receiver plays a big role in such tests - if it is bottlenecked then you are not really testing the limits of the transmitter. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
On Wed, 2007-22-08 at 13:21 -0700, David Miller wrote: From: Rick Jones [EMAIL PROTECTED] Date: Wed, 22 Aug 2007 10:09:37 -0700 Should it be any more or less worrysome than small packet performance (eg the TCP_RR stuff I posted recently) being rather worse with TSO enabled than with it disabled? That, like any such thing shown by the batching changes, is a bug to fix. Possibly a bug - but you really should turn off TSO if you are doing huge interactive transactions (which is fair because there is a clear demarcation). The litmus test is the same as any change that is supposed to improve net performance - it has to demonstrate it is not intrusive and that it improves (consistently) performance. The standard metrics are {throughput, cpu-utilization, latency} i.e as long as one improves and others remain zero, it would make sense. Yes, i am religious for batching after all the invested sweat (and i continue to work on it hoping to demystify) - the theory makes a lot of sense. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
On Thu, 2007-23-08 at 18:04 -0400, jamal wrote: The litmus test is the same as any change that is supposed to improve net performance - it has to demonstrate it is not intrusive and that it improves (consistently) performance. The standard metrics are {throughput, cpu-utilization, latency} i.e as long as one improves and others remain zero, it would make sense. Yes, i am religious for batching after all the invested sweat (and i continue to work on it hoping to demystify) - the theory makes a lot of sense. Before someone jumps and strangles me ;- By litmus test i meant as applied to batching. [TSO already passed - iirc, it has been demostranted to really not add much to throughput (cant improve much over closeness to wire speed) but improve CPU utilization]. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
On Thu, 2007-23-08 at 15:30 -0700, David Miller wrote: From: jamal [EMAIL PROTECTED] Date: Thu, 23 Aug 2007 18:04:10 -0400 Possibly a bug - but you really should turn off TSO if you are doing huge interactive transactions (which is fair because there is a clear demarcation). I don't see how this can matter. TSO only ever does anything if you accumulate more than one MSS worth of data. I stand corrected then. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
On Thu, 2007-23-08 at 15:35 -0700, Rick Jones wrote: jamal wrote: [TSO already passed - iirc, it has been demostranted to really not add much to throughput (cant improve much over closeness to wire speed) but improve CPU utilization]. In the one gig space sure, but in the 10 Gig space, TSO on/off does make a difference for throughput. I am still so 1Gige;- I stand corrected again ;- cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
On Tue, 2007-21-08 at 00:18 -0700, David Miller wrote: Using 16K buffer size really isn't going to keep the pipe full enough for TSO. Why the comparison with TSO (or GSO for that matter)? Seems to me that is only valid/fair if you have a single flow. Batching is multi-flow focused (or i should say flow-unaware). cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
On Tue, 2007-21-08 at 11:51 -0700, David Miller wrote: Because TSO does batching already, so it's a very good tit for tat comparison of the new batching scheme vs. an existing one. Fair enough - I may have read too much into your email then;- For bulk type of apps (where TSO will make a difference) this a fair test. Hence i agree the 16KB buffer size is not sensible if the goal is to simulate such an app. However (and this is where i read too much into what you were saying) that the test by itself is insufficient comparison. You gotta look at the other side of the coin i.e. at apps where TSO wont buy much. Examples, a busy ssh or irc server and you could go as far as looking at the most pre-dominant app on the wild west, http (average page size from a few years back was in the range of 10-20K and can be simulated with good ole netperf/iperf). cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching
On Wed, 2007-08-08 at 16:05 +0200, Patrick McHardy wrote: This is not completely avoidable of course, but you can and should limit the damage by only pulling out as much packets as the driver can take and have the driver stop the queue afterwards. This why the dev-xmit_win exists in my patches. The driver says how much space it has using this variable - and only those packets get pulled off; so there is no conflict with any scheduling algorithm be it work/non-work conserving. The ideal case is never to carry over anything in dev-blist. However, there is a challenge with GSO/TSO: if say the descriptors in the driver are less than the list of skbs, you are faced with the dilema of sending a fraction of the gso list first. My current approach is to transmit as much as the driver would allow. On the next opportunity to transmit, you do immediately send anything remaining first before you ask the qdisc for more packets. An alternative approach i had entertained was not to send anything from the skb list when hitting such a case, but it hasnt seem needed so far (havent seen any leftovers from my experiments). cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
On Wed, 2007-08-08 at 21:42 +0800, Herbert Xu wrote: On Wed, Aug 08, 2007 at 03:49:00AM -0700, David Miller wrote: Not because I think it obviates your work, but rather because I'm curious, could you test a TSO-in-hardware driver converted to batching and see how TSO alone compares to batching for a pure TCP workload? You could even lower the bar by disabling TSO and enabling software GSO. From my observation for TCP packets slightly above MDU (upto 2K), GSO gives worse performance than non-GSO throughput-wise. Actually this has nothing to do with batching, rather the behavior is consistent with or without batching changes. I personally don't think it will help for that case at all as TSO likely does better job of coalescing the work _and_ reducing bus traffic as well as work in the TCP stack. I agree. I suspect the bulk of the effort is in getting these skb's created and processed by the stack so that by the time that they're exiting the qdisc there's not much to be saved anymore. pktgen shows a clear win if you test the driver path - which is what you should test because thats where the batching changes are. Using TCP or UDP adds other variables[1] that need to be isolated first in order to quantify the effect of batching. For throughput and CPU utilization, the benefit will be clear when there are a lot more flows. cheers, jamal [1] I think there are too many other variables in play unfortunately when you are dealing with a path that starts above the driver and one that covers end to end effect: traffic/app source, system clock sources as per my recent discovery, congestion control algorithms used, tuning of recevier etc. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 00/10] Implement batching skb API
KK, On Tue, 2007-24-07 at 09:14 +0530, Krishna Kumar2 wrote: J Hadi Salim [EMAIL PROTECTED] wrote on 07/23/2007 06:02:01 PM: Actually you have not sent netperf results with prep and without prep. My results were based on pktgen (which i explained as testing the driver). I think depending on netperf without further analysis is simplistic. It was like me doing forwarding tests on these patches. So _which_ non-LLTX driver doesnt do that? ;- I have no idea since I haven't looked at all drivers. Can you tell which all non-LLTX drivers does that ? I stated this as the sole criterea. The few i have peeked at all do it. I also think the e1000 should be converted to be non-LLTX. The rest of netdev is screaming to kill LLTX. tun driver doesnt use it either - but i doubt that makes it bloat Adding extra code that is currently not usable (esp from a submission point) is bloat. So far i have converted 3 drivers, 1 of them doesnt use it. Two more driver conversions are on the way, they will both use it. How is this bloat again? A few emails back you said if only IPOIB can use batching then thats good enough justification. You waltz in, have the luxury of looking at my code, presentations, many discussions with me etc ... luxury ? I had implemented the entire thing even before knowing that you are working on something similar! and I had sent the first proposal to netdev, I saw your patch at the end of may (or at least 2 weeks after you said it existed). That patch has very little resemblance to what you just posted conceptwise or codewise. I could post it if you would give me permission. *after* which you told that you have your own code and presentations (which I had never seen earlier - I joined netdev a few months back, earlier I was working on RDMA, Infiniband as you know). I am gonna assume you didnt know of my work - which i have been making public for about 3 years. Infact i talked about this topic when i visited your office in 2006 on a day you were not present, so it is plausible you didnt hear of it. And it didn't give me any great ideas either, remember I had posted results for E1000 at the time of sending the proposals. In mid-June you sent me a series of patches which included anything from changing variable names to combining qdisc_restart and about everything i referred to as being cosmetic differences in your posted patches. I took two of those and incorporated them in. One was an XXX in my code already to allocate the dev-blist (Commit: bb4464c5f67e2a69ffb233fcf07aede8657e4f63). The other one was a mechanical removal of the blist being passed (Commit: 0e9959e5ee6f6d46747c97ca8edc91b3eefa0757). Some of the others i asked you to defer. For example, the reason i gave you for not merging any qdisc_restart_combine changes is because i was waiting for Dave to swallow the qdisc_restart changes i made; otherwise maintainance becomes extremely painful for me. Sridhar actually provided a lot more valuable comments and fixes but has not planted a flag on behalf of the queen of spain like you did. However I do give credit in my proposal to you for what ideas that your provided (without actual code), and the same I did for other people who did the same, like Dave, Sridhar. BTW, you too had discussions with me, and I sent some patches to improve your code too, I incorporated two of your patches and asked for deferal of others. These patches have now shown up in what you claim as the difference. I just call them cosmetic difference not to downplay the importance of having an ethtool interface but because they do not make batching perform any better. The real differences are those two items. I am suprised you havent cannibalized those changes as well. I thought you renamed them to something else; according to your posting: This patch will work with drivers updated by Jamal, Matt Michael Chan with minor modifications - rename xmit_win to xmit_slots rename batch handler. Or maybe thats a future plan you have in mind? so it looks like a two way street to me (and that is how open source works and should). Open source is a lot more transparent than that. You posted a question, which was part of your research. I responded and told you i have patches; you asked me for them and i promptly ported them from pre-2.6.18 to the latest kernel at the time. The nature of this batching work is one of performance. So numbers are important. If you had some strong disagreements on something in the architecture, then it would be of great value to explain it in a technical detail - and more importantly to provide some numbers to say why it is a bad idea. You get numbers by running some tests. You did none of the above. Your effort has been to produce your patch for whatever reasons. This would not have been problematic to me if it actually was based within reasons of optimization because the end goal would have been achieved. I have deleted the rest of the email
[ofa-general] Re: [PATCH 00/10] Implement batching skb API
KK, On Mon, 2007-23-07 at 10:19 +0530, Krishna Kumar2 wrote: Hmmm ? Evgeniy has not even tested my code to find some regression :) And you may possibly not find much improvement in E1000 when you run iperf (which is what I do) compared to pktgen. Pktgen is the correct test (or the closest to correct) because it tests the driver tx path. iperf/netperf test the effect of batching on tcp/udp. Infact i would start with udp first. What you need to do if testing end-2-end is see where the effects occur. For example, it is feasible that batching is a little too aggressive and the receiver cant keep up (netstat -s before and after will be helpful). Maybe by such insight we can improve things. My experiments show it is useful (in a very visible way using pktgen) for e1000 to have the prep() interface. I meant : have you compared results of batching with prep on vs prep off, and what is the difference in BW ? Yes, and these results were sent to you as well a while back. When i get the time when i get back i will look em up in my test machine and resend. No. I see value only in non-LLTX drivers which also gets the same TX lock in the RX path. So _which_ non-LLTX driver doesnt do that? ;- The value is also there in LLTX drivers even if in just formating a skb ready for transmit. If this is not clear i could do a much longer writeup on my thought evolution towards adding prep(). In LLTX drivers, the driver does the 'prep' without holding the tx_lock in any case, so there should be no improvement. Could you send the write-up I will - please give me sometime; i am overloaded at the moment. There is *nothing* IPoIB specific or focus in my code. I said adding prep doesn't work for IPoIB and so it is pointless to add bloat to the code until some code can tun driver doesnt use it either - but i doubt that makes it bloat What I meant to say is that there isn't much point in saying that your code is not ready or you are using old code base, or has multiple restart functions, or is not tested enough, etc, and then say let's re-do/rethink the whole implementation when my code is already working and giving good results. The suggestive hand gesturing is the kind of thing that bothers me. What do you think: Would i be submitting patches in baed on 2.6.22-rc4? Would it make sense to include parallel qdisc paths? For heavens sake, i have told you i would be fine with accepting such changes when the qdisc restart changes went in first. You waltz in, have the luxury of looking at my code, presentations, many discussions with me etc ... When i ask for differences to code you produced, they now seem to sum up to the two below. You dont think theres some honest issue with this picture? OTOH, if you find some cases that are better handled with : 1. prep handler 2. xmit_win (which I don't have now), then please send me patches and I will also test out and incorporate. And then of course you will end up adding those because they are both useful, just calling them some other name. And then you will end up incorporating all the drivers i invested many hours (as a gratitous volunteer) to change and test - maybe you will change varibale names or rearrange some function. I am a very compromising person; i have no problem coauthoring these patches if you actually invest useful time like fixing things up and doing proper tests. But you are not doing that - instead you are being extremely aggressive and hijacking the whole thing. It is courteous if you find somebody else has a patch you point out whats wrong preferably with some proof. It sounds disingenuous but i may have misread you. (lacking in frankness, candor, or sincerity; falsely or hypocritically ingenuous; insincere) Sorry, no response to personal comments and have a flame-war :) Give me a better description. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 00/10] Implement batching skb API
KK, On Sun, 2007-22-07 at 11:57 +0530, Krishna Kumar2 wrote: Batching need not be useful for every hardware. My concern is there is no consistency in results. I see improvements on something which you say dont. You see improvement in something that Evgeniy doesnt etc. There are many knobs and we need in the minimal to find out how those play. I don't quite agree with that approach, eg, if the blist is empty and the driver tells there is space for one packet, you will add one packet and the driver sends it out and the device is stopped (with potentially lot of skbs on dev-q). Then no packets are added till the queue is enabled, at which time a flood of skbs will be processed increasing latency and holding lock for a single longer duration. My approach will mitigate holding lock for longer times and instead send skbs to the device as long as we are within the limits. Just as a side note _I do have this feature_ in the pktgen piece. Infact, You can tell pktgen what that bound is as opposed to the hard coding(look at the pktgen batchl parameter). I have not found it to be useful experimentally; actually, i should say i could not zone on a useful value by experimenting and it was better to turn it off. I never tried adding it to the qdisc path - but this is something i could try and as i said it may prove useful. Since E1000 doesn't seem to use the TX lock on RX (atleast I couldn't find it), I feel having prep will not help as no other cpu can execute the queue/xmit code anyway (E1000 is also a LLTX driver). My experiments show it is useful (in a very visible way using pktgen) for e1000 to have the prep() interface. Other driver that hold tx lock could get improvement however. So you do see the value then with non LLTX drivers, right? ;- The value is also there in LLTX drivers even if in just formating a skb ready for transmit. If this is not clear i could do a much longer writeup on my thought evolution towards adding prep(). I wonder if you tried enabling/disabling 'prep' on E1000 to see how the performance is affected. Absolutely. And regardless of whether its beneficial or not for e1000, theres clear benefit in the tg3 for example. If it helps, I guess you could send me a patch to add that and I can also test it to see what the effect is. I didn't add it since IPoIB wouldn't be able to exploit it (unless someone is kind enough to show me how to). Such core code should not just be focussed on IPOIB. I think the code I have is ready and stable, I am not sure how to intepret that - are you saying all-is-good and we should just push your code in? It sounds disingenuous but i may have misread you. cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 00/10] Implement batching skb API
I am (have been) under extreme travel mode - so i will have high latency in follow ups. On Fri, 2007-20-07 at 12:01 +0530, Krishna Kumar wrote: Hi Dave, Roland, everyone, In May, I had proposed creating an API for sending 'n' skbs to a driver to reduce lock overhead, DMA operations, and specific to drivers that have completion notification like IPoIB - reduce completion handling ([RFC] New driver API to speed up small packets xmits @ http://marc.info/?l=linux-netdevm=117880900818960w=2). I had also sent initial test results for E1000 which showed minor improvements (but also got degradations) @http://marc.info/?l=linux-netdevm=117887698405795w=2. Add to that context: that i have been putting out patches on this over the last 3+ years as well as several public presentations = last one being: http://vger.kernel.org/jamal_netconf2006.sxi My main problem (and obstacles to submitting the patches) has been a result of not doing the approriate testing - i had been testing forwarding path (in all my results post the latest patches) when i should really have been testing the improvement of the tx path. There is a parallel WIP by Jamal but the two implementations are completely different since the code bases from the start were separate. Key changes: - Use a single qdisc interface to avoid code duplication and reduce maintainability (sch_generic.c size reduces by ~9%). - Has per device configurable parameter to turn on/off batching. - qdisc_restart gets slightly modified while looking simple without any checks for batching vs regular code (infact only two lines have changed - 1. instead of dev_dequeue_skb, a new batch-aware function is called; and 2. an extra call to hard_start_xmit_batch. - No change in__qdisc_run other than a new argument (from DM's idea). - Applies to latest net-2.6.23 compared to 2.6.22-rc4 code. All the above are cosmetic differences. To me is the highest priority is making sure that batching is useful and what the limitations are. At some point, when all looks good - i dont mind adding an ethtool interface to turn off/on batching, merge with the new qdisc restart path instead of having a parallel path, solicit feedback on naming, where to allocate structs etc etc. All that is low prio if batching across a variety of hardware and applications doesnt prove useful. At the moment, i am unsure theres consistency to justify push batching in. Having said that below are the main architectural differences we have which is what we really need to discuss and see what proves useful: - Batching algo/processing is different (eg. if qdisc_restart() finds one skb in the batch list, it will try to batch more (upto a limit) instead of sending that out and batching the rest in the next call. This sounds a little more aggressive but maybe useful. I have experimented with setting upper bound limits (current patches have a pktgen interface to set the max to send) and have concluded that it is unneeded. Probing by letting the driver tell you what space is available has proven to be the best approach. I have been meaning to remove the code in pktgen which allows these limits. - Jamal's code has a separate hw prep handler called from the stack, and results are accessed in driver during xmit later. I have explained the reasoning to this a few times. A recent response to Michael Chan is here: http://marc.info/?l=linux-netdevm=118346921316657w=2 And heres a response to you that i havent heard back on: http://marc.info/?l=linux-netdevm=118355539503924w=2 My tests so far indicate this interface is useful. It doesnt apply well to some drivers (for example i dont use it in tun) - which makes it optional but useful nevertheless. I will be more than happy to kill this if i can find cases where it proves to be a bad idea. - Jamal's code has dev-xmit_win which is cached by the driver. Mine has dev-xmit_slots but this is used only by the driver while the core has a different mechanism to find how many skbs to batch. This is related to the first item. - Completely different structure/design coding styles. (This patch will work with drivers updated by Jamal, Matt Michael Chan with minor modifications - rename xmit_win to xmit_slots rename batch handler) Again, cosmetics (and indication you are morphing towards me). So if i was to sum up this, (it would be useful discussion to have on these) the real difference is: a) you have an extra check on refilling the skb list when you find that it has a single skb. I tagged this as being potentially useful. b) You have a check for some upper bound on the number of skbs to send to the driver. I tagged this as unnecessary - the interface is still on in my current code, so it shouldnt be hard to show one way or other. c) You dont have prep_xmit() Add to that list any other architectural differences
[ofa-general] TCP and batching WAS(Re: [PATCH 00/10] Implement batching skb API
On Fri, 2007-20-07 at 08:18 +0100, Stephen Hemminger wrote: You may see worse performance with batching in the real world when running over WAN's. Like TSO, batching will generate back to back packet trains that are subject to multi-packet synchronized loss. Has someone done any study on TSO effect? Doesnt ECN with a RED router help on something like this? I find it suprising that a single flow doing TSO would overwhelm a routers buffer. I actually think the value of batching as far as TCP is concerned is propotional to the number of flows. i.e the more flows you have the more batching you will end up doing. And if TCPs fairness is the legend talk it has been made to be, then i dont see this as problematic. BTW, something i noticed regards to GSO when testing batching: For TCP packets slightly above MDU (upto 2K), GSO gives worse performance than non-GSO. Actually has nothing to do with batching, rather it works the same way with or without batching changes. Another oddity: Looking at the flow rate from a purely packets/second (I know thats a router centric view, but i found it strange nevertheless) - you see that as packet size goes up, the pps also goes up. I tried mucking around with nagle etc, but saw no observable changes. Any insight? My expectation was that the pps would stay at least the same or get better with smaller packets (assuming theres less data to push around). cheers, jamal ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general