[ofa-general] reduce debt with Debt Pros

2008-02-21 Thread Jamal Wyatt
Get Out of Debt Today. Avoid Bankruptcy. Save Thousands... The Professional 
Way!!
http://blongl.cn/___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread jamal
On Tue, 2007-09-10 at 08:39 +0530, Krishna Kumar2 wrote:

 Driver might ask for 10 and we send 10, but LLTX driver might fail to get
 lock and return TX_LOCKED. I haven't seen your code in greater detail, but
 don't you requeue in that case too?

For others drivers that are non-batching and LLTX, it is possible - at
the moment in my patch i whine that the driver is buggy. I will fix this
up so it checks for NETIF_F_BTX. Thanks for pointing the above use case.

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCHES] TX batching

2007-10-09 Thread jamal
On Tue, 2007-09-10 at 13:44 +0530, Krishna Kumar2 wrote:

 My feeling is that since the approaches are very different, 

My concern is the approaches are different only for short periods of
time. For example, I do requeueing, have xmit_win, have -end_xmit,
do batching from core etc; if you see value in any of these concepts,
they will appear in your patches and this goes on a loop. Perhaps what
we need is a referee and use our energies in something more positive.

 it would be a good idea to test the two for performance. 

Which i dont mind as long as it has an analysis that goes with it.
If all you post is heres what netperf showed, it is not useful at all.
There are also a lot of affecting variables. For example, is the
receiver a bottleneck? To make it worse, I could demonstrate to you that
if i slowed down the driver and allowed more packets to queue up on the
qdisc, batching will do well. In the past my feeling is you glossed over
such details and i am sucker for things like that - hence the conflict.

 Do you mind me doing
 that? Ofcourse others and/or you are more than welcome to do the same.
 
 I had sent a note to you yesterday about this, please let me know
 either way.
 

I responded to you - but it may have been lost in the noise; heres a
copy:
http://marc.info/?l=linux-netdevm=119185137124008w=2

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [PATCHES] TX batching rev2.5

2007-10-09 Thread jamal

Please provide feedback on the code and/or architecture.
They are now updated to work with the latest rebased net-2.6.24 
from a few hours ago.
I am on travel mode so wont have time to do more testing for the next
few days - i do consider this to be stable at this point based on
what i have been testing (famous last words).

Patch 1: Introduces batching interface
Patch 2: Core uses batching interface
Patch 3: get rid of dev-gso_skb

What has changed since i posted last:
-

1) There was some cruft leftover from prep_frame feature that i forgot
to remove last time - it is now gone.
2) In the shower this AM, i realized that it is plausible that a 
batch of packets sent to the driver may all be dropped because
they are badly formatted. Current drivers all return NETDEV_TX_OK
in all such cases. This will lead for us to call dev-hard_end_xmit() 
to be invoked unnecessarily. I already had a NETDEV_TX_DROPPED within
batching drivers, so i made that global and make the batching drivers
return it if they drop packets. The core calls dev-hard_end_xmit()
when at least one packet makes it through. 


Things i am gonna say that nobody will see (wink)
-
Dave please let me know if this meets your desires to allow devices
which are SG and able to compute CSUM benefit just in case i
misunderstood. 
Herbert, if you can look at at least patch 3 i will appreaciate it
(since it kills dev-gso_skb that you introduced).

UPCOMING PATCHES
---
As before:
More patches to follow later if i get some feedback - i didnt want to 
overload people by dumping too many patches. Most of these patches 
mentioned below are ready to go; some need some re-testing and others 
need a little porting from an earlier kernel: 
- tg3 driver 
- tun driver
- pktgen
- netiron driver
- e1000e driver (non-LLTX)
- ethtool interface
- There is at least one other driver promised to me

Theres also a driver-howto that i will post soon today. 

PERFORMANCE TESTING


System under test hardware is still a 2xdual core opteron with a 
couple of tg3s. 
A test tool generates udp traffic of different sizes for upto 60 
seconds per run or a total of 30M packets. I have 4 threads each 
running on a specific CPU which keep all the CPUs as busy as they can 
sending packets targetted at a directly connected box's udp discard
port.
All 4 CPUs target a single tg3 to send. The receiving box has a tc rule 
which counts and drops all incoming udp packets to discard port - this
allows me to make sure that the receiver is not the bottleneck in the
testing. Packet sizes sent are {8B, 32B, 64B, 128B, 256B, 512B, 1024B}. 
Each packet size run is repeated 10 times to ensure that there are no
transients. The average of all 10 runs is then computed and collected.

I do plan also to run forwarding and TCP tests in the future when the
dust settles.

cheers,
jamal


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [PATCH 1/3] [NET_BATCH] Introduce batching interface Rev2.5

2007-10-09 Thread jamal
This patch introduces the netdevice interface for batching.

cheers,
jamal
[NET_BATCH] Introduce batching interface

This patch introduces the netdevice interface for batching.

BACKGROUND
-

A driver dev-hard_start_xmit() has 4 typical parts:
a) packet formating (example vlan, mss, descriptor counting etc)
b) chip specific formatting
c) enqueueing the packet on a DMA ring
d) IO operations to complete packet transmit, tell DMA engine to chew on,
tx completion interupts, set last tx time, etc

[For code cleanliness/readability sake, regardless of this work,
one should break the dev-hard_start_xmit() into those 4 functions
anyways].

INTRODUCING API
---

With the api introduced in this patch, a driver which has all
4 parts and needing to support batching is advised to split its
dev-hard_start_xmit() in the following manner:
1)Remove #d from dev-hard_start_xmit() and put it in
dev-hard_end_xmit() method.
2)#b and #c can stay in -hard_start_xmit() (or whichever way you want
to do this)
3) #a is deffered to future work to reduce confusion (since it holds
on its own).

Note: There are drivers which may need not support any of the two
approaches (example the tun driver i patched) so the methods are
optional.

xmit_win variable is set by the driver to tell the core how much space
it has to take on new skbs. It is introduced to ensure that when we pass
the driver a list of packets it will swallow all of them - which is
useful because we dont requeue to the qdisc (and avoids burning
unnecessary cpu cycles or introducing any strange re-ordering). The driver
tells us when it invokes netif_wake_queue how much space it has for
descriptors by setting this variable.

Refer to the driver howto for more details.

THEORY OF OPERATION
---

1. Core dequeues from qdiscs upto dev-xmit_win packets. Fragmented
and GSO packets are accounted for as well.
2. Core grabs TX_LOCK
3. Core loop for all skbs:
invokes driver dev-hard_start_xmit()
4. Core invokes driver dev-hard_end_xmit()

ACKNOWLEDGEMENT AND SOME HISTORY


There's a lot of history and reasoning of why batching in a document
i am writting which i may submit as a patch.
Thomas Graf (who doesnt know this probably) gave me the impetus to
start looking at this back in 2004 when he invited me to the linux
conference he was organizing. Parts of what i presented in SUCON in
2004 talk about batching. Herbert Xu forced me to take a second look around
2.6.18 - refer to my netconf 2006 presentation. Krishna Kumar provided
me with more motivation in May 2007 when he posted on netdev and engaged
me.
Sridhar Samudrala, Krishna Kumar, Matt Carlson, Michael Chan,
Jeremy Ethridge, Evgeniy Polyakov, Sivakumar Subramani, David Miller,
and Patrick McHardy, Jeff Garzik and Bill Fink have contributed in one or
more of {bug fixes, enhancements, testing, lively discussion}. The
Broadcom and neterion folks have been outstanding in their help.

Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED]

---
commit 98d39ea7922fa2719a80eecd02cae359f3d7
tree 63822bf3040ea41846399c589c912c2be654f008
parent 7b4cd20628fe5c4e145c383fcd8d954d38f7be61
author Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:06:28 -0400
committer Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:06:28 -0400

 include/linux/netdevice.h |9 +-
 net/core/dev.c|   67 ++---
 net/sched/sch_generic.c   |4 +--
 3 files changed, 73 insertions(+), 7 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 91cd3f3..b0e71c9 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -86,6 +86,7 @@ struct wireless_dev;
 /* Driver transmit return codes */
 #define NETDEV_TX_OK 0		/* driver took care of packet */
 #define NETDEV_TX_BUSY 1	/* driver tx path was busy*/
+#define NETDEV_TX_DROPPED 2	/* driver tx path dropped packet*/
 #define NETDEV_TX_LOCKED -1	/* driver tx lock was already taken */
 
 /*
@@ -467,6 +468,7 @@ struct net_device
 #define NETIF_F_NETNS_LOCAL	8192	/* Does not change network namespaces */
 #define NETIF_F_MULTI_QUEUE	16384	/* Has multiple TX/RX queues */
 #define NETIF_F_LRO		32768	/* large receive offload */
+#define NETIF_F_BTX		65536	/* Capable of batch tx */
 
 	/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT	16
@@ -595,6 +597,9 @@ struct net_device
 	void			*priv;	/* pointer to private data	*/
 	int			(*hard_start_xmit) (struct sk_buff *skb,
 		struct net_device *dev);
+	void			(*hard_end_xmit) (struct net_device *dev);
+	int			xmit_win;
+
 	/* These may be needed for future network-power-down code. */
 	unsigned long		trans_start;	/* Time (in jiffies) of last Tx	*/
 
@@ -609,6 +614,7 @@ struct net_device
 
 	/* delayed register/unregister */
 	struct list_head	todo_list;
+	struct sk_buff_head blist;
 	/* device index hash chain */
 	struct hlist_node	index_hlist;
 
@@ -1043,7 +1049,8

[ofa-general] [PATCH 2/3][NET_BATCH] Rev2.5 net core use batching

2007-10-09 Thread jamal
This patch adds the usage of batching within the core.

cheers,
jamal
[NET_BATCH] net core use batching

This patch adds the usage of batching within the core.
Performance results demonstrating improvement are provided separately.

I have #if-0ed some of the old functions so the patch is more readable.
A future patch will remove all if-0ed content.
Patrick McHardy eyeballed a bug that will cause re-ordering in case
of a requeue.

Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED]

---
commit c73d8ee8cce61a98f55fbfb2cafe813a7eca472c
tree 8b9155fe15baa4c2e7adb69585c7aa275a6bc896
parent 98d39ea7922fa2719a80eecd02cae359f3d7
author Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:13:30 -0400
committer Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:13:30 -0400

 net/sched/sch_generic.c |  104 ++-
 1 files changed, 94 insertions(+), 10 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 424c08b..d98c680 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -56,6 +56,7 @@ static inline int qdisc_qlen(struct Qdisc *q)
 	return q-q.qlen;
 }
 
+#if 0
 static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev,
   struct Qdisc *q)
 {
@@ -110,6 +111,85 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb,
 
 	return ret;
 }
+#endif
+
+static inline int handle_dev_cpu_collision(struct net_device *dev)
+{
+	if (unlikely(dev-xmit_lock_owner == smp_processor_id())) {
+		if (net_ratelimit())
+			printk(KERN_WARNING
+Dead loop on netdevice %s, fix it urgently!\n,
+dev-name);
+		return 1;
+	}
+	__get_cpu_var(netdev_rx_stat).cpu_collision++;
+	return 0;
+}
+
+static inline int
+dev_requeue_skbs(struct sk_buff_head *skbs, struct net_device *dev,
+	   struct Qdisc *q)
+{
+
+	struct sk_buff *skb;
+
+	while ((skb = __skb_dequeue_tail(skbs)) != NULL)
+		q-ops-requeue(skb, q);
+
+	netif_schedule(dev);
+	return 0;
+}
+
+static inline int
+xmit_islocked(struct sk_buff_head *skbs, struct net_device *dev,
+	struct Qdisc *q)
+{
+	int ret = handle_dev_cpu_collision(dev);
+
+	if (ret) {
+		if (!skb_queue_empty(skbs))
+			skb_queue_purge(skbs);
+		return qdisc_qlen(q);
+	}
+
+	return dev_requeue_skbs(skbs, dev, q);
+}
+
+static int xmit_count_skbs(struct sk_buff *skb)
+{
+	int count = 0;
+	for (; skb; skb = skb-next) {
+		count += skb_shinfo(skb)-nr_frags;
+		count += 1;
+	}
+	return count;
+}
+
+static int xmit_get_pkts(struct net_device *dev,
+			   struct Qdisc *q,
+			   struct sk_buff_head *pktlist)
+{
+	struct sk_buff *skb;
+	int count = dev-xmit_win;
+
+	if (count   dev-gso_skb) {
+		skb = dev-gso_skb;
+		dev-gso_skb = NULL;
+		count -= xmit_count_skbs(skb);
+		__skb_queue_tail(pktlist, skb);
+	}
+
+	while (count  0) {
+		skb = q-dequeue(q);
+		if (!skb)
+			break;
+
+		count -= xmit_count_skbs(skb);
+		__skb_queue_tail(pktlist, skb);
+	}
+
+	return skb_queue_len(pktlist);
+}
 
 /*
  * NOTE: Called under dev-queue_lock with locally disabled BH.
@@ -133,19 +213,20 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb,
 static inline int qdisc_restart(struct net_device *dev)
 {
 	struct Qdisc *q = dev-qdisc;
-	struct sk_buff *skb;
-	int ret, xcnt = 0;
+	int ret = 0;
 
-	/* Dequeue packet */
-	if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL))
-		return 0;
+	/* Dequeue packets */
+	ret = xmit_get_pkts(dev, q, dev-blist);
 
+	if (!ret)
+		return 0;
 
-	/* And release queue */
+	/* We got em packets */
 	spin_unlock(dev-queue_lock);
 
+	/* bye packets */
 	HARD_TX_LOCK(dev, smp_processor_id());
-	ret = dev_hard_start_xmit(skb, dev, xcnt);
+	ret = dev_batch_xmit(dev);
 	HARD_TX_UNLOCK(dev);
 
 	spin_lock(dev-queue_lock);
@@ -158,8 +239,8 @@ static inline int qdisc_restart(struct net_device *dev)
 		break;
 
 	case NETDEV_TX_LOCKED:
-		/* Driver try lock failed */
-		ret = handle_dev_cpu_collision(skb, dev, q);
+		/* Driver lock failed */
+		ret = xmit_islocked(dev-blist, dev, q);
 		break;
 
 	default:
@@ -168,7 +249,7 @@ static inline int qdisc_restart(struct net_device *dev)
 			printk(KERN_WARNING BUG %s code %d qlen %d\n,
 			   dev-name, ret, q-q.qlen);
 
-		ret = dev_requeue_skb(skb, dev, q);
+		ret = dev_requeue_skbs(dev-blist, dev, q);
 		break;
 	}
 
@@ -564,6 +645,9 @@ void dev_deactivate(struct net_device *dev)
 
 	skb = dev-gso_skb;
 	dev-gso_skb = NULL;
+	if (!skb_queue_empty(dev-blist))
+		skb_queue_purge(dev-blist);
+	dev-xmit_win = 1;
 	spin_unlock_bh(dev-queue_lock);
 
 	kfree_skb(skb);
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 3/3][NET_BATCH] Rev2.5 kill dev-gso_skb

2007-10-09 Thread jamal
This patch removes dev-gso_skb as it is no longer necessary with
batching code.

cheers,
jamal

[NET_BATCH] kill dev-gso_skb
The batching code does what gso used to batch at the drivers.
There is no more need for gso_skb. If for whatever reason the
requeueing is a bad idea we are going to leave packets in dev-blist
(and still not need dev-gso_skb)

Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED]

---
commit fac8a4147548f314d4edb74634e78e5b06e0e135
tree 72114acb327bc7e3eb219275df6b3aab7459795c
parent c73d8ee8cce61a98f55fbfb2cafe813a7eca472c
author Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:22:43 -0400
committer Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:22:43 -0400

 include/linux/netdevice.h |3 ---
 net/sched/sch_generic.c   |   12 
 2 files changed, 0 insertions(+), 15 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b0e71c9..7592a56 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -578,9 +578,6 @@ struct net_device
 	struct list_head	qdisc_list;
 	unsigned long		tx_queue_len;	/* Max frames per queue allowed */
 
-	/* Partially transmitted GSO packet. */
-	struct sk_buff		*gso_skb;
-
 	/* ingress path synchronizer */
 	spinlock_t		ingress_lock;
 	struct Qdisc		*qdisc_ingress;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index d98c680..36b6972 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -172,13 +172,6 @@ static int xmit_get_pkts(struct net_device *dev,
 	struct sk_buff *skb;
 	int count = dev-xmit_win;
 
-	if (count   dev-gso_skb) {
-		skb = dev-gso_skb;
-		dev-gso_skb = NULL;
-		count -= xmit_count_skbs(skb);
-		__skb_queue_tail(pktlist, skb);
-	}
-
 	while (count  0) {
 		skb = q-dequeue(q);
 		if (!skb)
@@ -635,7 +628,6 @@ void dev_activate(struct net_device *dev)
 void dev_deactivate(struct net_device *dev)
 {
 	struct Qdisc *qdisc;
-	struct sk_buff *skb;
 
 	spin_lock_bh(dev-queue_lock);
 	qdisc = dev-qdisc;
@@ -643,15 +635,11 @@ void dev_deactivate(struct net_device *dev)
 
 	qdisc_reset(qdisc);
 
-	skb = dev-gso_skb;
-	dev-gso_skb = NULL;
 	if (!skb_queue_empty(dev-blist))
 		skb_queue_purge(dev-blist);
 	dev-xmit_win = 1;
 	spin_unlock_bh(dev-queue_lock);
 
-	kfree_skb(skb);
-
 	dev_watchdog_down(dev);
 
 	/* Wait for outstanding dev_queue_xmit calls. */
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCHES] TX batching rev2.5

2007-10-09 Thread jamal
On Tue, 2007-09-10 at 18:07 -0400, jamal wrote:
 Please provide feedback on the code and/or architecture.
 They are now updated to work with the latest rebased net-2.6.24 
 from a few hours ago.

I should have added i have tested this with just the batching changes
and it is within the performance realm of the changes from yesterday. If
anyone wants exact numbers, i can send them.

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [DOC][NET_BATCH]Rev2.5 Driver Howto

2007-10-09 Thread jamal

I updated this doc to match the latest patch.

cheers,
jamal
Here's the beginning of a howto for driver authors.

The intended audience for this howto is people already
familiar with netdevices.

1.0  Netdevice Prerequisites
--

For hardware-based netdevices, you must have at least hardware that 
is capable of doing DMA with many descriptors; i.e., having hardware 
with a queue length of 3 (as in some fscked ethernet hardware) is 
not very useful in this case.

2.0  What is new in the driver API
---

There is 1 new method and one new variable introduced that the
driver author needs to be aware of. These are:
1) dev-hard_end_xmit()
2) dev-xmit_win

2.1 Using Core driver changes
-

To provide context, let's look at a typical driver abstraction
for dev-hard_start_xmit(). It has 4 parts:
a) packet formatting (example: vlan, mss, descriptor counting, etc.)
b) chip-specific formatting
c) enqueueing the packet on a DMA ring
d) IO operations to complete packet transmit, tell DMA engine to chew 
on, tx completion interrupts, etc.

[For code cleanliness/readability sake, regardless of this work,
one should break the dev-hard_start_xmit() into those 4 functional
blocks anyways].

A driver which has all 4 parts and needing to support batching is 
advised to split its dev-hard_start_xmit() in the following manner:

1) use its dev-hard_end_xmit() method to achieve #d
2) use dev-xmit_win to tell the core how much space you have.

#b and #c can stay in -hard_start_xmit() (or whichever way you 
want to do this)
Section 3. shows more details on the suggested usage.

2.1.1 Theory of operation
--

1. Core dequeues from qdiscs upto dev-xmit_win packets. Fragmented
and GSO packets are accounted for as well.
2. Core grabs device's TX_LOCK
3. Core loop for all skbs:
 -invokes driver dev-hard_start_xmit()
4. Core invokes driver dev-hard_end_xmit() if packets xmitted

2.1.1.1 The slippery LLTX
-

Since these type of drivers are being phased out and they require extra
code they will not be supported anymore. So as oct07 the code that 
supports them has been removed.

2.1.1.2 xmit_win


dev-xmit_win variable is set by the driver to tell us how
much space it has in its rings/queues. This detail is then 
used to figure out how many packets are retrieved from the qdisc 
queues (in order to send to the driver). 
dev-xmit_win is introduced to ensure that when we pass the driver 
a list of packets it will swallow all of them -- which is useful 
because we don't requeue to the qdisc (and avoids burning unnecessary 
CPU cycles or introducing any strange re-ordering). 
Essentially the driver signals us how much space it has for descriptors 
by setting this variable. 

2.1.1.2.1 Setting xmit_win
--

This variable should be set during xmit path shutdown(netif_stop), 
wakeup(netif_wake) and -hard_end_xmit(). In the case of the first
one the value is set to 1 and in the other two it is set to whatever
the driver deems to be available space on the ring.

3.0 Driver Essentials
-

The typical driver tx state machine is:


-1- +Core sends packets
 +-- Driver puts packet onto hardware queue
 +if hardware queue is full, netif_stop_queue(dev)
 +
-2- +core stops sending because of netif_stop_queue(dev)
..
.. time passes ...
..
-3- +--- driver has transmitted packets, opens up tx path by
  invoking netif_wake_queue(dev)
-1- +Cycle repeats and core sends more packets (step 1).


3.1  Driver prerequisite
--

This is _a very important_ requirement in making batching useful.
The prerequisite for batching changes is that the driver should 
provide a low threshold to open up the tx path.
Drivers such as tg3 and e1000 already do this.
Before you invoke netif_wake_queue(dev) you check if there is a
threshold of space reached to insert new packets.

Here's an example of how I added it to tun driver. Observe the
setting of dev-xmit_win.

---
+#define NETDEV_LTT 4 /* the low threshold to open up the tx path */
..
..
u32 t = skb_queue_len(tun-readq);
if (netif_queue_stopped(tun-dev)  t  NETDEV_LTT) {
tun-dev-xmit_win = tun-dev-tx_queue_len;
netif_wake_queue(tun-dev);
}
---

Heres how the batching e1000 driver does it:

--
if (unlikely(cleaned  netif_carrier_ok(netdev) 
 E1000_DESC_UNUSED(tx_ring) = TX_WAKE_THRESHOLD)) {

if (netif_queue_stopped(netdev)) {
   int rspace =  E1000_DESC_UNUSED(tx_ring) - (MAX_SKB_FRAGS +  2);
   netdev-xmit_win = rspace;
   netif_wake_queue(netdev);
   }
---

in tg3 code (with no batching changes) looks like:

-
if (netif_queue_stopped(tp-dev) 
(tg3_tx_avail(tp)  TG3_TX_WAKEUP_THRESH(tp)))
netif_wake_queue(tp-dev

[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread jamal
On Mon, 2007-08-10 at 10:33 +0530, Krishna Kumar2 wrote:

  As a side note: Any batching driver should _never_ have to requeue; if
  it does it is buggy. And the non-batching ones if they ever requeue will
  be a single packet, so not much reordering.
 
 On the contrary, batching LLTX drivers (if that is not ruled out) will very
 often requeue resulting in heavy reordering. Fix looks good though.

Two things: 
one, LLTX is deprecated (I think i saw a patch which says no more new
drivers should do LLTX) and i plan if nobody else does to kill 
LLTX in e1000 RSN. So for that reason i removed all code that existed to
support LLTX.
two, there should _never_ be any requeueing even if LLTX in the previous
patches when i supported them; if there is, it is a bug. This is because
we dont send more than what the driver asked for via xmit_win. So if it
asked for more than it can handle, that is a bug. If its available space
changes while we are sending to it, that too is a bug.

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock

2007-10-08 Thread jamal
On Sun, 2007-07-10 at 21:51 -0700, David Miller wrote:

 For these high performance 10Gbit cards it's a load balancing
 function, really, as all of the transmit queues go out to the same
 physical port so you could:
 
 1) Load balance on CPU number.
 2) Load balance on flow
 3) Load balance on destination MAC
 
 etc. etc. etc.

The brain-block i am having is the parallelization aspect of it.
Whatever scheme it is - it needs to ensure the scheduler works as
expected. For example, if it was a strict prio scheduler i would expect
that whatever goes out is always high priority first and never ever
allow a low prio packet out at any time theres something high prio
needing to go out. If i have the two priorities running on two cpus,
then i cant guarantee that effect.
IOW, i see the scheduler/qdisc level as not being split across parallel
cpus. Do i make any sense?

The rest of my understanding hinges on the above, so let me stop here.

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 1/3] [NET_BATCH] Introduce batching interface

2007-10-08 Thread jamal
On Mon, 2007-08-10 at 15:29 +0530, Krishna Kumar2 wrote:
 Hi Jamal,
 
 If you don't mind, I am trying to run your approach vs mine to get some
 results for comparison.

Please provide an analysis when you get the results. IOW, explain why
one vs the other get different results.

 For starters, I am having issues with iperf when using your infrastructure
 code with
 my IPoIB driver - about 100MB is sent and then everything stops for some
 reason.

I havent tested with iperf in a while.
Can you post the netstat on both sides when the driver stops?
It does sound like a driver issue to me.

 The changes in the IPoIB driver that I made to support batching is to set
 BTX, set
 xmit_win, and dynamically reduce xmit_win on every xmit 
 and increase xmit_win on every xmit completion. 

From driver howto:
---
This variable should be set during xmit path shutdown(netif_stop),
wakeup(netif_wake) and -hard_end_xmit(). In the case of the first
one the value is set to 1 and in the other two it is set to whatever
the driver deems to be available space on the ring.


 Is there anything else that is required from the
 driver?

Your driver needs to also support wake thresholding.
I will post the driver howto later today.

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCHES] TX batching

2007-10-08 Thread jamal
On Mon, 2007-08-10 at 16:51 +0400, Evgeniy Polyakov wrote:

 it looks like you and Krishna use the same requeueing methods - get one
 from qdisk, queue it into blist, get next from qdisk, queue it,
 eventually start transmit, where you dequeue it one-by-one and send (or
 prepare and commit). This is not the 100% optimal approach, but if you
 proved it does not hurt usual network processing, it is ok.

There are probably other bottlenecks that hide the need to optimize
further.

 Number of comments dusted to very small - that's a sign, but I'm a bit
 lost - did you and Krishna create the competing approaches, or they can
 co-exist together, in the former case I doubt you can push, until all
 problematic places are resolved, in the latter case, this is probably
 ready.

Thanks. I would like to make one more cleanup and get rid of the
temporary pkt list in qdisc restart; now that i have defered the skb
pre-format interface it is unnecessary.  I have a day off today, so i
will make changes, re-run tests and post again.

I dont see something from Krishna's approach that i can take and reuse.
This maybe because my old approaches have evolved from the same path.
There is a long list but as a sample: i used to do a lot more work while
holding the queue lock which i have now moved post queue lock; i dont
have any speacial interfaces/tricks just for batching, i provide hints
to the core of how much the driver can take etc etc. I have offered
Krishna co-authorship if he makes the IPOIB driver to work on my
patches, that offer still stands if he chooses to take it. 

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: parallel networking (was Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock)

2007-10-08 Thread jamal
On Mon, 2007-08-10 at 10:22 -0400, Jeff Garzik wrote:

 Any chance the NIC hardware could provide that guarantee?

If you can get the scheduling/dequeuing to run on one CPU (as we do
today) it should work; alternatively you can totaly bypass the qdisc
subystem and go direct to the hardware for devices that are capable
and that would work but would require huge changes. 
My fear is there's a mini-scheduler pieces running on multi cpus which
is what i understood as being described.

 8139cp, for example, has two TX DMA rings, with hardcoded 
 characteristics:  one is a high prio q, and one a low prio q.  The logic 
 is pretty simple:   empty the high prio q first (potentially starving 
 low prio q, in worst case).

sounds like strict prio scheduling to me which says if low prio starves
so be it

 In terms of overall parallelization, both for TX as well as RX, my gut 
 feeling is that we want to move towards an MSI-X, multi-core friendly 
 model where packets are LIKELY to be sent and received by the same set 
 of [cpus | cores | packages | nodes] that the [userland] processes 
 dealing with the data.

Does putting things in the same core help? But overall i agree with your
views. 

 There are already some primitive NUMA bits in skbuff allocation, but 
 with modern MSI-X and RX/TX flow hashing we could do a whole lot more, 
 along the lines of better CPU scheduling decisions, directing flows to 
 clusters of cpus, and generally doing a better job of maximizing cache 
 efficiency in a modern multi-thread environment.

I think i see the receive with a lot of clarity, i am still foggy on the
txmit path mostly because of the qos/scheduling issues. 

 IMO the current model where each NIC's TX completion and RX processes 
 are both locked to the same CPU is outmoded in a multi-core world with 
 modern NICs.  :)

Infact even with status quo theres a case that can be made to not bind
to interupts.
In my recent experience with batching, due to the nature of my test app,
if i let the interupts float across multiple cpus i benefit.
My app runs/binds a thread per CPU and so benefits from having more
juice to send more packets per unit of time - something i wouldnt get if
i was always running on one cpu.
But when i do this i found that just because i have bound a thread to
cpu3 doesnt mean that thread will always run on cpu3. If netif_wakeup
happens on cpu1, scheduler will put the thread on cpu1 if it is to be
run. It made sense to do that, it just took me a while to digest.

 But I readily admit general ignorance about the kernel process 
 scheduling stuff, so my only idea about a starting point was to see how 
 far to go with the concept of skb affinity -- a mask in sk_buff that 
 is a hint about which cpu(s) on which the NIC should attempt to send and 
 receive packets.  When going through bonding or netfilter, it is trivial 
 to 'or' together affinity masks.  All the various layers of net stack 
 should attempt to honor the skb affinity, where feasible (requires 
 interaction with CFS scheduler?).

There would be cache benefits if you can free the packet on the same cpu
it was allocated; so the idea of skb affinity is useful in the minimal
in that sense if you can pull it. Assuming hardware is capable, even if
you just tagged it on xmit to say which cpu it was sent out on, and made
sure thats where it is freed, that would be a good start. 

Note: The majority of the packet processing overhead is _still_ the
memory subsystem latency; in my tests with batched pktgen improving the 
xmit subsystem meant the overhead on allocing and freeing the packets
went to something  80%.
So something along the lines of parallelizing based on a split of alloc
free of sksb IMO on more cpus than where xmit/receive run would see
more performance improvements.

 Or maybe skb affinity is a dumb idea.  I wanted to get people thinking 
 on the bigger picture.  Parallelization starts at the user process.


cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [DOC][NET_BATCH] Driver howto

2007-10-08 Thread jamal
This is an updated driver howto for batching that works with patches
from yesterday and the revised ones i am going to post.

cheers,
jamal
Here's the beginning of a howto for driver authors.

The intended audience for this howto is people already
familiar with netdevices.

1.0  Netdevice Prerequisites
--

For hardware-based netdevices, you must have at least hardware that 
is capable of doing DMA with many descriptors; i.e., having hardware 
with a queue length of 3 (as in some fscked ethernet hardware) is 
not very useful in this case.

2.0  What is new in the driver API
---

There is 1 new method and one new variable introduced that the
driver author needs to be aware of. These are:
1) dev-hard_end_xmit()
2) dev-xmit_win

2.1 Using Core driver changes
-

To provide context, let's look at a typical driver abstraction
for dev-hard_start_xmit(). It has 4 parts:
a) packet formatting (example: vlan, mss, descriptor counting, etc.)
b) chip-specific formatting
c) enqueueing the packet on a DMA ring
d) IO operations to complete packet transmit, tell DMA engine to chew 
on, tx completion interrupts, etc.

[For code cleanliness/readability sake, regardless of this work,
one should break the dev-hard_start_xmit() into those 4 functional
blocks anyways].

A driver which has all 4 parts and needing to support batching is 
advised to split its dev-hard_start_xmit() in the following manner:

1) use its dev-hard_end_xmit() method to achieve #d
2) use dev-xmit_win to tell the core how much space you have.

#b and #c can stay in -hard_start_xmit() (or whichever way you 
want to do this)
Section 3. shows more details on the suggested usage.

2.1.1 Theory of operation
--

1. Core dequeues from qdiscs upto dev-xmit_win packets. Fragmented
and GSO packets are accounted for as well.
2. Core grabs device's TX_LOCK
3. Core loop for all skbs:
 -invokes driver dev-hard_start_xmit()
4. Core invokes driver dev-hard_end_xmit() if packets xmitted

2.1.1.1 The slippery LLTX
-

Since these type of drivers are being phased out and they require extra
code they will not be supported anymore. So as oct07 the code that 
supports them has been removed.

2.1.1.2 xmit_win


dev-xmit_win variable is set by the driver to tell us how
much space it has in its rings/queues. This detail is then 
used to figure out how many packets are retrieved from the qdisc 
queues (in order to send to the driver). 
dev-xmit_win is introduced to ensure that when we pass the driver 
a list of packets it will swallow all of them -- which is useful 
because we don't requeue to the qdisc (and avoids burning unnecessary 
CPU cycles or introducing any strange re-ordering). 
Essentially the driver signals us how much space it has for descriptors 
by setting this variable. 

2.1.1.2.1 Setting xmit_win
--

This variable should be set during xmit path shutdown(netif_stop), 
wakeup(netif_wake) and -hard_end_xmit(). In the case of the first
one the value is set to 1 and in the other two it is set to whatever
the driver deems to be available space on the ring.

3.0 Driver Essentials
-

The typical driver tx state machine is:


-1- +Core sends packets
 +-- Driver puts packet onto hardware queue
 +if hardware queue is full, netif_stop_queue(dev)
 +
-2- +core stops sending because of netif_stop_queue(dev)
..
.. time passes ...
..
-3- +--- driver has transmitted packets, opens up tx path by
  invoking netif_wake_queue(dev)
-1- +Cycle repeats and core sends more packets (step 1).


3.1  Driver prerequisite
--

This is _a very important_ requirement in making batching useful.
The prerequisite for batching changes is that the driver should 
provide a low threshold to open up the tx path.
Drivers such as tg3 and e1000 already do this.
Before you invoke netif_wake_queue(dev) you check if there is a
threshold of space reached to insert new packets.

Here's an example of how I added it to tun driver. Observe the
setting of dev-xmit_win.

---
+#define NETDEV_LTT 4 /* the low threshold to open up the tx path */
..
..
u32 t = skb_queue_len(tun-readq);
if (netif_queue_stopped(tun-dev)  t  NETDEV_LTT) {
tun-dev-xmit_win = tun-dev-tx_queue_len;
netif_wake_queue(tun-dev);
}
---

Heres how the batching e1000 driver does it:

--
if (unlikely(cleaned  netif_carrier_ok(netdev) 
 E1000_DESC_UNUSED(tx_ring) = TX_WAKE_THRESHOLD)) {

if (netif_queue_stopped(netdev)) {
   int rspace =  E1000_DESC_UNUSED(tx_ring) - (MAX_SKB_FRAGS +  2);
   netdev-xmit_win = rspace;
   netif_wake_queue(netdev);
   }
---

in tg3 code (with no batching changes) looks like:

-
if (netif_queue_stopped(tp-dev) 
(tg3_tx_avail(tp

[ofa-general] [PATCHES] TX batching rev2

2007-10-08 Thread jamal

Please provide feedback on the code and/or architecture.
Last time i posted them i received little. They are now updated to 
work with the latest net-2.6.24 from a few hours ago.

Patch 1: Introduces batching interface
Patch 2: Core uses batching interface
Patch 3: get rid of dev-gso_skb

What has changed since i posted last:
Killed the temporary packet list that is passed to qdisc restart.

Dave please let me know if this meets your desires to allow devices
which are SG and able to compute CSUM benefit just in case i
misunderstood. 
Herbert, if you can look at at least patch 3 i will appreaciate it
(since it kills dev-gso_skb that you introduced).

UPCOMING PATCHES
---
As before:
More patches to follow later if i get some feedback - i didnt want to 
overload people by dumping too many patches. Most of these patches 
mentioned below are ready to go; some need some re-testing and others 
need a little porting from an earlier kernel: 
- tg3 driver 
- tun driver
- pktgen
- netiron driver
- e1000 driver (LLTX)
- e1000e driver (non-LLTX)
- ethtool interface
- There is at least one other driver promised to me

Theres also a driver-howto i wrote that was posted on netdev last week
as well as one that describes the architectural decisions made.

PERFORMANCE TESTING


System under test hardware is still a 2xdual core opteron with a 
couple of tg3s. 
A test tool generates udp traffic of different sizes for upto 60 
seconds per run or a total of 30M packets. I have 4 threads each 
running on a specific CPU which keep all the CPUs as busy as they can 
sending packets targetted at a directly connected box's udp discard
port.
All 4 CPUs target a single tg3 to send. The receiving box has a tc rule 
which counts and drops all incoming udp packets to discard port - this
allows me to make sure that the receiver is not the bottleneck in the
testing. Packet sizes sent are {8B, 32B, 64B, 128B, 256B, 512B, 1024B}. 
Each packet size run is repeated 10 times to ensure that there are no
transients. The average of all 10 runs is then computed and collected.

I do plan also to run forwarding and TCP tests in the future when the
dust settles.

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [PATCH 1/3] [NET_BATCH] Introduce batching interface

2007-10-08 Thread jamal
This patch introduces the netdevice interface for batching.

cheers,
jamal

[NET_BATCH] Introduce batching interface

This patch introduces the netdevice interface for batching.

BACKGROUND
-

A driver dev-hard_start_xmit() has 4 typical parts:
a) packet formating (example vlan, mss, descriptor counting etc)
b) chip specific formatting
c) enqueueing the packet on a DMA ring
d) IO operations to complete packet transmit, tell DMA engine to chew on,
tx completion interupts, set last tx time, etc

[For code cleanliness/readability sake, regardless of this work,
one should break the dev-hard_start_xmit() into those 4 functions
anyways].

INTRODUCING API
---

With the api introduced in this patch, a driver which has all
4 parts and needing to support batching is advised to split its
dev-hard_start_xmit() in the following manner:
1)Remove #d from dev-hard_start_xmit() and put it in
dev-hard_end_xmit() method.
2)#b and #c can stay in -hard_start_xmit() (or whichever way you want
to do this)
3) #a is deffered to future work to reduce confusion (since it holds
on its own).

Note: There are drivers which may need not support any of the two
approaches (example the tun driver i patched) so the methods are
optional.

xmit_win variable is set by the driver to tell the core how much space
it has to take on new skbs. It is introduced to ensure that when we pass
the driver a list of packets it will swallow all of them - which is
useful because we dont requeue to the qdisc (and avoids burning
unnecessary cpu cycles or introducing any strange re-ordering). The driver
tells us when it invokes netif_wake_queue how much space it has for
descriptors by setting this variable.

Refer to the driver howto for more details.

THEORY OF OPERATION
---

1. Core dequeues from qdiscs upto dev-xmit_win packets. Fragmented
and GSO packets are accounted for as well.
2. Core grabs TX_LOCK
3. Core loop for all skbs:
invokes driver dev-hard_start_xmit()
4. Core invokes driver dev-hard_end_xmit()

ACKNOWLEDGEMENT AND SOME HISTORY


There's a lot of history and reasoning of why batching in a document
i am writting which i may submit as a patch.
Thomas Graf (who doesnt know this probably) gave me the impetus to
start looking at this back in 2004 when he invited me to the linux
conference he was organizing. Parts of what i presented in SUCON in
2004 talk about batching. Herbert Xu forced me to take a second look around
2.6.18 - refer to my netconf 2006 presentation. Krishna Kumar provided
me with more motivation in May 2007 when he posted on netdev and engaged
me.
Sridhar Samudrala, Krishna Kumar, Matt Carlson, Michael Chan,
Jeremy Ethridge, Evgeniy Polyakov, Sivakumar Subramani, David Miller,
and Patrick McHardy, Jeff Garzik and Bill Fink have contributed in one or
more of {bug fixes, enhancements, testing, lively discussion}. The
Broadcom and neterion folks have been outstanding in their help.

Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED]

---
commit 8b9960cfbb98d48c0ac30b2c00d27c25217adb26
tree d8ab164757cee431f71a6011df4b09e0c382e2da
parent cef811600354e9f41f059570fc877d19a1daed99
author Jamal Hadi Salim [EMAIL PROTECTED] Mon, 08 Oct 2007 09:03:42 -0400
committer Jamal Hadi Salim [EMAIL PROTECTED] Mon, 08 Oct 2007 09:03:42 -0400

 include/linux/netdevice.h |   11 +++
 net/core/dev.c|   40 
 2 files changed, 51 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 91cd3f3..b31df5c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -467,6 +467,7 @@ struct net_device
 #define NETIF_F_NETNS_LOCAL	8192	/* Does not change network namespaces */
 #define NETIF_F_MULTI_QUEUE	16384	/* Has multiple TX/RX queues */
 #define NETIF_F_LRO		32768	/* large receive offload */
+#define NETIF_F_BTX		65536	/* Capable of batch tx */
 
 	/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT	16
@@ -595,6 +596,9 @@ struct net_device
 	void			*priv;	/* pointer to private data	*/
 	int			(*hard_start_xmit) (struct sk_buff *skb,
 		struct net_device *dev);
+	void			(*hard_end_xmit) (struct net_device *dev);
+	int			xmit_win;
+
 	/* These may be needed for future network-power-down code. */
 	unsigned long		trans_start;	/* Time (in jiffies) of last Tx	*/
 
@@ -609,6 +613,7 @@ struct net_device
 
 	/* delayed register/unregister */
 	struct list_head	todo_list;
+	struct sk_buff_head blist;
 	/* device index hash chain */
 	struct hlist_node	index_hlist;
 
@@ -1044,6 +1049,12 @@ extern int		dev_set_mac_address(struct net_device *,
 	struct sockaddr *);
 extern int		dev_hard_start_xmit(struct sk_buff *skb,
 	struct net_device *dev);
+extern int		dev_batch_xmit(struct net_device *dev);
+extern int		prepare_gso_skb(struct sk_buff *skb,
+	struct net_device *dev,
+	struct sk_buff_head *skbs);
+extern int

[ofa-general] [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread jamal
This patch adds the usage of batching within the core.

cheers,
jamal

[NET_BATCH] net core use batching

This patch adds the usage of batching within the core.
Performance results demonstrating improvement are provided separately.

I have #if-0ed some of the old functions so the patch is more readable.
A future patch will remove all if-0ed content.
Patrick McHardy eyeballed a bug that will cause re-ordering in case
of a requeue.

Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED]

---
commit 63381156b35719a364d0f81fec487e6263fb5467
tree 615dfb5f8617a8a4edffbb965e81a75c60597319
parent 8b9960cfbb98d48c0ac30b2c00d27c25217adb26
author Jamal Hadi Salim [EMAIL PROTECTED] Mon, 08 Oct 2007 09:10:14 -0400
committer Jamal Hadi Salim [EMAIL PROTECTED] Mon, 08 Oct 2007 09:10:14 -0400

 net/sched/sch_generic.c |  116 +++
 1 files changed, 106 insertions(+), 10 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 95ae119..96bfdcb 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -56,6 +56,7 @@ static inline int qdisc_qlen(struct Qdisc *q)
 	return q-q.qlen;
 }
 
+#if 0
 static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev,
   struct Qdisc *q)
 {
@@ -110,6 +111,97 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb,
 
 	return ret;
 }
+#endif
+
+static inline int handle_dev_cpu_collision(struct net_device *dev)
+{
+	if (unlikely(dev-xmit_lock_owner == smp_processor_id())) {
+		if (net_ratelimit())
+			printk(KERN_WARNING
+Dead loop on netdevice %s, fix it urgently!\n,
+dev-name);
+		return 1;
+	}
+	__get_cpu_var(netdev_rx_stat).cpu_collision++;
+	return 0;
+}
+
+static inline int
+dev_requeue_skbs(struct sk_buff_head *skbs, struct net_device *dev,
+	   struct Qdisc *q)
+{
+
+	struct sk_buff *skb;
+
+	while ((skb = __skb_dequeue_tail(skbs)) != NULL)
+		q-ops-requeue(skb, q);
+
+	netif_schedule(dev);
+	return 0;
+}
+
+static inline int
+xmit_islocked(struct sk_buff_head *skbs, struct net_device *dev,
+	struct Qdisc *q)
+{
+	int ret = handle_dev_cpu_collision(dev);
+
+	if (ret) {
+		if (!skb_queue_empty(skbs))
+			skb_queue_purge(skbs);
+		return qdisc_qlen(q);
+	}
+
+	return dev_requeue_skbs(skbs, dev, q);
+}
+
+static int xmit_count_skbs(struct sk_buff *skb)
+{
+	int count = 0;
+	for (; skb; skb = skb-next) {
+		count += skb_shinfo(skb)-nr_frags;
+		count += 1;
+	}
+	return count;
+}
+
+static int xmit_get_pkts(struct net_device *dev,
+			   struct Qdisc *q,
+			   struct sk_buff_head *pktlist)
+{
+	struct sk_buff *skb;
+	int count = dev-xmit_win;
+
+	if (count   dev-gso_skb) {
+		skb = dev-gso_skb;
+		dev-gso_skb = NULL;
+		count -= xmit_count_skbs(skb);
+		__skb_queue_tail(pktlist, skb);
+	}
+
+	while (count  0) {
+		skb = q-dequeue(q);
+		if (!skb)
+			break;
+
+		count -= xmit_count_skbs(skb);
+		__skb_queue_tail(pktlist, skb);
+	}
+
+	return skb_queue_len(pktlist);
+}
+
+static int xmit_prepare_pkts(struct net_device *dev,
+			 struct sk_buff_head *tlist)
+{
+	struct sk_buff *skb;
+	struct sk_buff_head *flist = dev-blist;
+
+	while ((skb = __skb_dequeue(tlist)) != NULL)
+		xmit_prepare_skb(skb, dev);
+
+	return skb_queue_len(flist);
+}
 
 /*
  * NOTE: Called under dev-queue_lock with locally disabled BH.
@@ -130,22 +222,23 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb,
  *0 - queue is not empty.
  *
  */
+
 static inline int qdisc_restart(struct net_device *dev)
 {
 	struct Qdisc *q = dev-qdisc;
-	struct sk_buff *skb;
-	int ret;
+	int ret = 0;
 
-	/* Dequeue packet */
-	if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL))
-		return 0;
+	ret = xmit_get_pkts(dev, q, dev-blist);
 
+	if (!ret)
+		return 0;
 
-	/* And release queue */
+	/* We got em packets */
 	spin_unlock(dev-queue_lock);
 
+	/* bye packets */
 	HARD_TX_LOCK(dev, smp_processor_id());
-	ret = dev_hard_start_xmit(skb, dev);
+	ret = dev_batch_xmit(dev);
 	HARD_TX_UNLOCK(dev);
 
 	spin_lock(dev-queue_lock);
@@ -158,8 +251,8 @@ static inline int qdisc_restart(struct net_device *dev)
 		break;
 
 	case NETDEV_TX_LOCKED:
-		/* Driver try lock failed */
-		ret = handle_dev_cpu_collision(skb, dev, q);
+		/* Driver lock failed */
+		ret = xmit_islocked(dev-blist, dev, q);
 		break;
 
 	default:
@@ -168,7 +261,7 @@ static inline int qdisc_restart(struct net_device *dev)
 			printk(KERN_WARNING BUG %s code %d qlen %d\n,
 			   dev-name, ret, q-q.qlen);
 
-		ret = dev_requeue_skb(skb, dev, q);
+		ret = dev_requeue_skbs(dev-blist, dev, q);
 		break;
 	}
 
@@ -564,6 +657,9 @@ void dev_deactivate(struct net_device *dev)
 
 	skb = dev-gso_skb;
 	dev-gso_skb = NULL;
+	if (!skb_queue_empty(dev-blist))
+		skb_queue_purge(dev-blist);
+	dev-xmit_win = 1;
 	spin_unlock_bh(dev-queue_lock);
 
 	kfree_skb(skb);
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe

[ofa-general] [PATCH 3/3][NET_BATCH] kill dev-gso_skb

2007-10-08 Thread jamal
This patch removes dev-gso_skb as it is no longer necessary with
batching code.

cheers,
jamal

[NET_BATCH] kill dev-gso_skb
The batching code does what gso used to batch at the drivers.
There is no more need for gso_skb. If for whatever reason the
requeueing is a bad idea we are going to leave packets in dev-blist
(and still not need dev-gso_skb)

Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED]

---
commit 3a6202d62adff75b85d6ca0f8fd491abf9d63f4b
tree 80497c538bdb3eab6ab81ff1dec1c3263da79826
parent 63381156b35719a364d0f81fec487e6263fb5467
author Jamal Hadi Salim [EMAIL PROTECTED] Mon, 08 Oct 2007 09:11:02 -0400
committer Jamal Hadi Salim [EMAIL PROTECTED] Mon, 08 Oct 2007 09:11:02 -0400

 include/linux/netdevice.h |3 ---
 net/sched/sch_generic.c   |   12 
 2 files changed, 0 insertions(+), 15 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b31df5c..4ddc6eb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -577,9 +577,6 @@ struct net_device
 	struct list_head	qdisc_list;
 	unsigned long		tx_queue_len;	/* Max frames per queue allowed */
 
-	/* Partially transmitted GSO packet. */
-	struct sk_buff		*gso_skb;
-
 	/* ingress path synchronizer */
 	spinlock_t		ingress_lock;
 	struct Qdisc		*qdisc_ingress;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 96bfdcb..9112ea0 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -172,13 +172,6 @@ static int xmit_get_pkts(struct net_device *dev,
 	struct sk_buff *skb;
 	int count = dev-xmit_win;
 
-	if (count   dev-gso_skb) {
-		skb = dev-gso_skb;
-		dev-gso_skb = NULL;
-		count -= xmit_count_skbs(skb);
-		__skb_queue_tail(pktlist, skb);
-	}
-
 	while (count  0) {
 		skb = q-dequeue(q);
 		if (!skb)
@@ -647,7 +640,6 @@ void dev_activate(struct net_device *dev)
 void dev_deactivate(struct net_device *dev)
 {
 	struct Qdisc *qdisc;
-	struct sk_buff *skb;
 
 	spin_lock_bh(dev-queue_lock);
 	qdisc = dev-qdisc;
@@ -655,15 +647,11 @@ void dev_deactivate(struct net_device *dev)
 
 	qdisc_reset(qdisc);
 
-	skb = dev-gso_skb;
-	dev-gso_skb = NULL;
 	if (!skb_queue_empty(dev-blist))
 		skb_queue_purge(dev-blist);
 	dev-xmit_win = 1;
 	spin_unlock_bh(dev-queue_lock);
 
-	kfree_skb(skb);
-
 	dev_watchdog_down(dev);
 
 	/* Wait for outstanding dev_queue_xmit calls. */
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [NET_BATCH] Some perf results

2007-10-08 Thread jamal

Ive attached a small pdf with results. This adds on top of results I
posted yesterday (although i didnt see them reflected on netdev).

1) batch-ntlst is the patches posted today that remove the temporary
list in qdisc restart and is derived from this AM net-2.6.24
2) batch-kern is result of batching patches posted yesterday that had
the temporary list and is based on net-2.6.24 from yesterday AM
3) net-2.6.2 is yesterday's AM net-2.6.24 with no changes,

So #1 is not a completely fair comparison with #2 and #3. However,
looking at the logs, the changes that have gone in are unrelated to the
areas i have touched, so i dont expect any effect.

Overall, removing the temporay list from qdisc_restart provides a small
improvement noticeable only at the smaller packet sizes. 

In any case, enjoy.

cheers,
jamal


batch-res2.pdf
Description: Adobe PDF document
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread jamal
On Mon, 2007-08-10 at 12:46 -0700, Waskiewicz Jr, Peter P wrote:

   I still have concerns how this will work with Tx multiqueue.
 The way the batching code looks right now, you will probably send a
 batch of skb's from multiple bands from PRIO or RR to the driver.  For
 non-Tx multiqueue drivers, this is fine.  For Tx multiqueue drivers,
 this isn't fine, since the Tx ring is selected by the value of
 skb-queue_mapping (set by the qdisc on {prio|rr}_classify()).  If the
 whole batch comes in with different queue_mappings, this could prove to
 be an interesting issue.

true, that needs some resolution. Heres a hand-waving thought:
Assuming all packets of a specific map end up in the same qdiscn queue,
it seems feasible to ask the qdisc scheduler to give us enough packages
(ive seen people use that terms to refer to packets) for each hardware
ring's available space. With the patches i posted, i do that via
dev-xmit_win that assumes only one view of the driver; essentially a
single ring.  
If that is doable, then it is up to the driver to say
i have space for 5 in ring[0], 10 in ring[1] 0 in ring[2] based on
what scheduling scheme the driver implements - the dev-blist can stay
the same. Its a handwave, so there may be issues there and there could
be better ways to handle this.

Note: The other issue that needs resolving that i raised earlier was in
regards to multiqueue running on multiple cpus servicing different rings
concurently. 

   Now I see in the driver HOWTO you recently sent that the driver
 will be expected to loop over the list and call it's -hard_start_xmit()
 for each skb.  

It's the core that does that, not the driver; the driver continues to
use -hard_start_xmit() (albeit modified one). The idea is not to have
many new interfaces.

 I think that should be fine for multiqueue, I just wanted
 to see if you had any thoughts on how it should work, any performance
 issues you can see (I can't think of any).  Since the batching feature
 and Tx multiqueue are very new features, I'd like to make sure we can
 think of any possible issues with them coexisting before they are both
 mainline.

Isnt multiqueue mainline already?

   Looking ahead for multiqueue, I'm still working on the per-queue
 lock implementation for multiqueue, which I know will not work with
 batching as it's designed today.  

The point behind batching is to reduce the cost of the locks by
amortizing across the locks. Even better if one can, they should get rid
of locks. Remind me, why do you need the per-queuemap lock? And is it
needed from the enqueuing side too? Maybe lets start there to help me
understand things?

 I'm still not sure how to handle this,
 because it really would require the batch you send to have the same
 queue_mapping in each skb, so you're grabbing the correct queue_lock.

Sure, that is doable if the driver can set a per queue_mapping xmit_win
and the qdisc can be taught to say give me packets for queue_mapping X

 Or, we could have the core grab all the queue locks for each
 skb-queue_mapping represented in the batch.  That would block another
 batch though if it had any of those queues in it's next batch before the
 first one completed.  Thoughts?

I am not understanding the desire to have locks on a per-queuemap. I
think the single queuelock we have today should suffice. If the intent
is to have concurent cpus running to each hardware ring, then this is
what i questioned earlier whether it was the right thing to do(very top
of email where i mention it as other issue). 

cheers,
jamal


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: parallel networking

2007-10-08 Thread jamal
On Mon, 2007-08-10 at 14:11 -0700, David Miller wrote:

 The problem is that the packet schedulers want global guarantees
 on packet ordering, not flow centric ones.
 
 That is the issue Jamal is concerned about.

indeed, thank you for giving it better wording. 

 The more I think about it, the more inevitable it seems that we really
 might need multiple qdiscs, one for each TX queue, to pull this full
 parallelization off.
 
 But the semantics of that don't smell so nice either.  If the user
 attaches a new qdisc to ethN, does it go to all the TX queues, or
 what?
 
 All of the traffic shaping technology deals with the device as a unary
 object.  It doesn't fit to multi-queue at all.

If you let only one CPU at a time access the xmit path you solve all
the reordering. If you want to be more fine grained you make the
serialization point as low as possible in the stack - perhaps in the
driver.
But I think even what we have today with only one cpu entering the
dequeue/scheduler region, _for starters_, is not bad actually ;-  What
i am finding (and i can tell you i have been trying hard;-) is that a
sufficiently fast cpu doesnt sit in the dequeue area for too long (and
batching reduces the time spent further). Very quickly there are no more
packets for it to dequeue from the qdisc or the driver is stoped and it
has to get out of there. If you dont have any interupt tied to a
specific cpu then you can have many cpus enter and leave that region all
the time. 

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread jamal
On Mon, 2007-08-10 at 14:26 -0700, David Miller wrote:

 Add xmit_win to struct net_device_subqueue, problem solved.

If  net_device_subqueue is visible from both driver and core scheduler
area (couldnt tell from looking at whats in there already), then that'll
do it.

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] RE: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread jamal
On Mon, 2007-08-10 at 15:33 -0700, Waskiewicz Jr, Peter P wrote:


 Addressing your note/issue with different rings being services
 concurrently: I'd like to remove the QDISC_RUNNING bit from the global

The challenge to deal with is that netdevices, filters, the queues and
scheduler are closely inter-twined. So it is not just the scheduling
region and QDISC_RUNNING. For example, lets pick just the filters
because they are simple to see: You need to attach them to something -
whatever that is, you then need to synchronize against config and
multiple cpus trying to use them. You could: 
a) replicate them across cpus and only lock on config, but you are
wasting RAM then
b) attach them to rings instead of netdevices - but that makes me wonder
if those subqueues are now going to become netdevices. This also means
you change all user space interfaces to know about subqueues. If you
recall this was a major contention in our earlier discussion.

 device; with Tx multiqueue, this bit should be set on each queue (if at
 all), allowing multiple Tx rings to be loaded simultaneously. 

This is the issue i raised - refer to Dave's wording of it. If you run
access to the rings simultenously you may not be able to guarantee any
ordering or proper qos in contention for wire-resources (think strict
prio in hardware) - as long as you have the qdisc area.  You may
actually get away with it with something like DRR.
You could totaly bypass the qdisc region and go to the driver directly
and let it worry about the scheduling but youd have to make the qdisc
area a passthrough while providing the illusion to user space that all
is as before.

  The
 biggest issue today with the multiqueue implementation is the global
 queue_lock.  I see it being a hot source of contention in my testing; my
 setup is a 8-core machine (dual quad-core procs) with a 10GbE NIC, using
 8 Tx and 8 Rx queues.  On transmit, when loading all 8 queues, the
 enqueue/dequeue are hitting that lock quite a bit for the whole device.

Yes, the queuelock is expensive; in your case if all 8 hardware threads
are contending for that one device, you will suffer. The txlock on the
other hand is not that expensive since the contention is for a max of 2
cpus (tx and rx softirq). 
I tried to use that fact in the batching to move things that i processed
under queue lock into the area for txlock. I'd be very interested in
some results on such a piece of hardware with the 10G nic to see if
these theories make any sense.

 I really think that the queue_lock should join the queue_state, so the
 device no longer manages the top-level state (since we're operating
 per-queue instead of per-device).

Refer to above.


 
 The multiqueue implementation today enforces the number of qdisc bands
 (RR or PRIO) to be equal to the number of Tx rings your hardware/driver
 is supporting.  Therefore, the queue_lock and queue_state in the kernel
 directly relate to the qdisc band management.  If the queue stops from
 the driver, then the qdisc won't try to dequeue from the band. 

Good start.

  What I'm
 working on is to move the lock there too, so I can lock the queue when I
 enqueue (protect the band from multiple sources modifying the skb
 chain), and lock it when I dequeue.  This is purely for concurrency of
 adding/popping skb's from the qdisc queues.  

Ok, so the concurency aspect is what worries me. What i am saying is
that sooner or later you have to serialize (which is anti-concurency)
For example, consider CPU0 running a high prio queue and CPU1 running
the low prio queue of the same netdevice.
Assume CPU0 is getting a lot of interupts or other work while CPU1
doesnt (so as to create a condition that CPU1 is slower). Then as long
as there packets and there is space on the drivers rings, CPU1 will send
more packets per unit time than CPU0.
This contradicts the strict prio scheduler which says higher priority
packets ALWAYS go out first regardless of the presence of low prio
packets.  I am not sure i made sense.

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: parallel networking

2007-10-08 Thread jamal
On Mon, 2007-08-10 at 15:33 -0700, David Miller wrote:

 Multiply whatever effect you think you might be able to measure due to
 that on your 2 or 4 way system, and multiple it up to 64 cpus or so
 for machines I am using.  This is where machines are going, and is
 going to become the norm.

Yes, i keep forgetting that ;- I need to train my brain to remember
that.

cheers,
jamal



___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread jamal
On Mon, 2007-08-10 at 18:41 -0700, David Miller wrote:

 I also want to point out another issue.  Any argument wrt. reordering
 is specious at best because right now reordering from qdisc to device
 happens anyways.

 And that's because we drop the qdisc lock first, then we grab the
 transmit lock on the device and submit the packet.  So, after we
 drop the qdisc lock, another cpu can get the qdisc lock, get the
 next packet (perhaps a lower priority one) and then sneak in to
 get the device transmit lock before the first thread can, and
 thus the packets will be submitted out of order.
 

You forgot QDISC_RUNNING Dave;- the above cant happen. 
Essentially at any one point in time, we are guaranteed that we can have
multiple cpus enqueueing but only can be dequeueing (the one that
managed to grab QDISC_RUNNING) i.e multiple producers to the qdisc queue
but only one consumer. Only the dequeuer has access to the txlock.
 
 This, along with other things, makes me believe that ordering really
 doesn't matter in practice.  And therefore, in practice, we can treat
 everything from the qdisc to the real hardware as a FIFO even if
 something else is going on inside the black box which might reorder
 packets on the wire.

I think it is important to get the scheduling right - estimations can be
a last resort. For example, If i have voip competing for the wire with
ftp on two different rings/cpus and i specified that voip should be more
important i may consider equipment faulty if it works most of the
time (when ftp is not clogging the wire) and at times i am asked to
repeat what i just said. 

cheers,
jamal


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread jamal
On Tue, 2007-09-10 at 10:04 +0800, Herbert Xu wrote:

 Please revert
 
 commit 41843197b17bdfb1f97af0a87c06d24c1620ba90
 Author: Jamal Hadi Salim [EMAIL PROTECTED]
 Date:   Tue Sep 25 19:27:13 2007 -0700
 
 [NET_SCHED]: explict hold dev tx lock
 
 As this change introduces potential reordering and I don't think
 we've discussed this aspect sufficiently.

How does it introduce reordering?

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread jamal
On Tue, 2007-09-10 at 10:16 +0800, Herbert Xu wrote:

 
 No it doesn't.  I'd forgotten about the QDISC_RUNNING bit :)

You should not better, you wrote it and ive been going insane trying to
break for at least a year now ;-

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [PATCHES] TX batching

2007-10-07 Thread jamal

Please provide feedback on the code and/or architecture.
Last time i posted them i received little. They are now updated to 
work with the latest net-2.6.24 from a few hours ago.

Patch 1: Introduces batching interface
Patch 2: Core uses batching interface
Patch 3: get rid of dev-gso_skb

What has changed since i posted last:
1) Fix a bug eyeballed by Patrick McHardy on requeue reordering.
2) Killed -hard_batch_xmit() 
3) I am going one step back and making this set of patches even simpler
so i can make it easier to review.I am therefore killing dev-hard_prep_xmit()
and focussing just on batching. I plan to re-introduce dev-hard_prep_xmit()
but from now on i will make that a separate effort. (it seems to be creating
confusion in relation to the general work).

Dave please let me know if this meets your desires to allow devices
which are SG and able to compute CSUM benefit just in case i misunderstood. 
Herbert, if you can look at at least patch 3 i will appreaciate it
(since it kills dev-gso_skb that you introduced).

UPCOMING PATCHES
---
As before:
More patches to follow later if i get some feedback - i didnt want to 
overload people by dumping too many patches. Most of these patches 
mentioned below are ready to go; some need some re-testing and others 
need a little porting from an earlier kernel: 
- tg3 driver 
- tun driver
- pktgen
- netiron driver
- e1000 driver (LLTX)
- e1000e driver (non-LLTX)
- ethtool interface
- There is at least one other driver promised to me

Theres also a driver-howto i wrote that was posted on netdev last week
as well as one that describes the architectural decisions made.

PERFORMANCE TESTING

I started testing since yesterday, but these tests take a long time
so i will post results probably at the end of the day sometime and
may stop running more tests and just comparing batch vs non-batch results.
I have optimized the kernel-config so i expect my overall performance
numbers to look better than the last test results i posted for both
batch and non-batch.
My system under test hardware is still a 2xdual core opteron with a 
couple of tg3s. 
A test tool generates udp traffic of different sizes for upto 60 
seconds per run or a total of 30M packets. I have 4 threads each 
running on a specific CPU which keep all the CPUs as busy as they can 
sending packets targetted at a directly connected box's udp discard port.
All 4 CPUs target a single tg3 to send. The receiving box has a tc rule 
which counts and drops all incoming udp packets to discard port - this
allows me to make sure that the receiver is not the bottleneck in the
testing. Packet sizes sent are {8B, 32B, 64B, 128B, 256B, 512B, 1024B}. 
Each packet size run is repeated 10 times to ensure that there are no
transients. The average of all 10 runs is then computed and collected.

I do plan also to run forwarding and TCP tests in the future when the
dust settles.

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [PATCH 1/3] [NET_BATCH] Introduce batching interface

2007-10-07 Thread jamal
This patch introduces the netdevice interface for batching.

cheers,
jamal


[NET_BATCH] Introduce batching interface

This patch introduces the netdevice interface for batching.

BACKGROUND
-

A driver dev-hard_start_xmit() has 4 typical parts:
a) packet formating (example vlan, mss, descriptor counting etc)
b) chip specific formatting
c) enqueueing the packet on a DMA ring
d) IO operations to complete packet transmit, tell DMA engine to chew on,
tx completion interupts, set last tx time, etc

[For code cleanliness/readability sake, regardless of this work,
one should break the dev-hard_start_xmit() into those 4 functions
anyways].

INTRODUCING API
---

With the api introduced in this patch, a driver which has all
4 parts and needing to support batching is advised to split its
dev-hard_start_xmit() in the following manner:
1)Remove #d from dev-hard_start_xmit() and put it in
dev-hard_end_xmit() method.
2)#b and #c can stay in -hard_start_xmit() (or whichever way you want
to do this)
3) #a is deffered to future work to reduce confusion (since it holds
on its own).

Note: There are drivers which may need not support any of the two
approaches (example the tun driver i patched) so the methods are
optional.

xmit_win variable is set by the driver to tell the core how much space
it has to take on new skbs. It is introduced to ensure that when we pass
the driver a list of packets it will swallow all of them - which is
useful because we dont requeue to the qdisc (and avoids burning
unnecessary cpu cycles or introducing any strange re-ordering). The driver
tells us when it invokes netif_wake_queue how much space it has for
descriptors by setting this variable.

Refer to the driver howto for more details.

THEORY OF OPERATION
---

1. Core dequeues from qdiscs upto dev-xmit_win packets. Fragmented
and GSO packets are accounted for as well.
2. Core grabs TX_LOCK
3. Core loop for all skbs:
invokes driver dev-hard_start_xmit()
4. Core invokes driver dev-hard_end_xmit()

ACKNOWLEDGEMENT AND SOME HISTORY


There's a lot of history and reasoning of why batching in a document
i am writting which i may submit as a patch.
Thomas Graf (who doesnt know this probably) gave me the impetus to
start looking at this back in 2004 when he invited me to the linux
conference he was organizing. Parts of what i presented in SUCON in
2004 talk about batching. Herbert Xu forced me to take a second look around
2.6.18 - refer to my netconf 2006 presentation. Krishna Kumar provided
me with more motivation in May 2007 when he posted on netdev and engaged
me.
Sridhar Samudrala, Krishna Kumar, Matt Carlson, Michael Chan,
Jeremy Ethridge, Evgeniy Polyakov, Sivakumar Subramani, David Miller,
and Patrick McHardy, Jeff Garzik and Bill Fink have contributed in one or 
more of {bug fixes, enhancements, testing, lively discussion}. The 
Broadcom and neterion folks have been outstanding in their help.

Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED]

---
commit 0a0762e2c615a980af284e86d9729d233e1bf7f4
tree c27fec824a9e75ffbb791647bdb595c082a54990
parent 190674ff1fe0b7bddf038c2bfddf45b9c6418e2a
author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 07 Oct 2007 08:51:10 -0400
committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 07 Oct 2007 08:51:10 -0400

 include/linux/netdevice.h |   11 ++
 net/core/dev.c|   83 +
 2 files changed, 94 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 91cd3f3..b31df5c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -467,6 +467,7 @@ struct net_device
 #define NETIF_F_NETNS_LOCAL8192/* Does not change network namespaces */
 #define NETIF_F_MULTI_QUEUE16384   /* Has multiple TX/RX queues */
 #define NETIF_F_LRO32768   /* large receive offload */
+#define NETIF_F_BTX65536   /* Capable of batch tx */
 
/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT  16
@@ -595,6 +596,9 @@ struct net_device
void*priv;  /* pointer to private data  */
int (*hard_start_xmit) (struct sk_buff *skb,
struct net_device *dev);
+   void(*hard_end_xmit) (struct net_device *dev);
+   int xmit_win;
+
/* These may be needed for future network-power-down code. */
unsigned long   trans_start;/* Time (in jiffies) of last Tx 
*/
 
@@ -609,6 +613,7 @@ struct net_device
 
/* delayed register/unregister */
struct list_headtodo_list;
+   struct sk_buff_head blist;
/* device index hash chain */
struct hlist_node   index_hlist;
 
@@ -1044,6 +1049,12 @@ extern int   dev_set_mac_address(struct 
net_device

[ofa-general] [PATCH 2/3][NET_BATCH] net core use batching

2007-10-07 Thread jamal
This patch adds the usage of batching within the core.

cheers,
jamal



[NET_BATCH] net core use batching

This patch adds the usage of batching within the core.
Performance results demonstrating improvement are provided separately.

I have #if-0ed some of the old functions so the patch is more readable.
A future patch will remove all if-0ed content.
Patrick McHardy eyeballed a bug that will cause re-ordering in case
of a requeue.

Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED]

---
commit cd602aa5f84fcef6359852cd99c95863eeb91015
tree f31d2dde4f138ff6789682163624bc0f8541aa77
parent 0a0762e2c615a980af284e86d9729d233e1bf7f4
author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 07 Oct 2007 09:13:04 -0400
committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 07 Oct 2007 09:13:04 -0400

 net/sched/sch_generic.c |  132 +++
 1 files changed, 120 insertions(+), 12 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 95ae119..80ac56b 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -56,6 +56,7 @@ static inline int qdisc_qlen(struct Qdisc *q)
return q-q.qlen;
 }
 
+#if 0
 static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev,
  struct Qdisc *q)
 {
@@ -110,6 +111,97 @@ static inline int handle_dev_cpu_collision(struct sk_buff 
*skb,
 
return ret;
 }
+#endif
+
+static inline int handle_dev_cpu_collision(struct net_device *dev)
+{
+   if (unlikely(dev-xmit_lock_owner == smp_processor_id())) {
+   if (net_ratelimit())
+   printk(KERN_WARNING
+   Dead loop on netdevice %s, fix it urgently!\n,
+   dev-name);
+   return 1;
+   }
+   __get_cpu_var(netdev_rx_stat).cpu_collision++;
+   return 0;
+}
+
+static inline int
+dev_requeue_skbs(struct sk_buff_head *skbs, struct net_device *dev,
+  struct Qdisc *q)
+{
+
+   struct sk_buff *skb;
+
+   while ((skb = __skb_dequeue_tail(skbs)) != NULL)
+   q-ops-requeue(skb, q);
+
+   netif_schedule(dev);
+   return 0;
+}
+
+static inline int
+xmit_islocked(struct sk_buff_head *skbs, struct net_device *dev,
+   struct Qdisc *q)
+{
+   int ret = handle_dev_cpu_collision(dev);
+
+   if (ret) {
+   if (!skb_queue_empty(skbs))
+   skb_queue_purge(skbs);
+   return qdisc_qlen(q);
+   }
+
+   return dev_requeue_skbs(skbs, dev, q);
+}
+
+static int xmit_count_skbs(struct sk_buff *skb)
+{
+   int count = 0;
+   for (; skb; skb = skb-next) {
+   count += skb_shinfo(skb)-nr_frags;
+   count += 1;
+   }
+   return count;
+}
+
+static int xmit_get_pkts(struct net_device *dev,
+  struct Qdisc *q,
+  struct sk_buff_head *pktlist)
+{
+   struct sk_buff *skb;
+   int count = dev-xmit_win;
+
+   if (count   dev-gso_skb) {
+   skb = dev-gso_skb;
+   dev-gso_skb = NULL;
+   count -= xmit_count_skbs(skb);
+   __skb_queue_tail(pktlist, skb);
+   }
+
+   while (count  0) {
+   skb = q-dequeue(q);
+   if (!skb)
+   break;
+
+   count -= xmit_count_skbs(skb);
+   __skb_queue_tail(pktlist, skb);
+   }
+
+   return skb_queue_len(pktlist);
+}
+
+static int xmit_prepare_pkts(struct net_device *dev,
+struct sk_buff_head *tlist)
+{
+   struct sk_buff *skb;
+   struct sk_buff_head *flist = dev-blist;
+
+   while ((skb = __skb_dequeue(tlist)) != NULL)
+   xmit_prepare_skb(skb, dev);
+
+   return skb_queue_len(flist);
+}
 
 /*
  * NOTE: Called under dev-queue_lock with locally disabled BH.
@@ -130,22 +222,32 @@ static inline int handle_dev_cpu_collision(struct sk_buff 
*skb,
  * 0 - queue is not empty.
  *
  */
-static inline int qdisc_restart(struct net_device *dev)
+
+static inline int qdisc_restart(struct net_device *dev,
+   struct sk_buff_head *tpktlist)
 {
struct Qdisc *q = dev-qdisc;
-   struct sk_buff *skb;
-   int ret;
+   int ret = 0;
 
-   /* Dequeue packet */
-   if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL))
-   return 0;
+   /* use of tpktlist reduces the amount of time we sit
+* holding the queue_lock
+   */
+   ret = xmit_get_pkts(dev, q, tpktlist);
 
+   if (!ret)
+   return 0;
 
-   /* And release queue */
+   /* We got em packets */
spin_unlock(dev-queue_lock);
 
+   /* prepare to embark, no locks held moves packets
+   * to dev-blist
+   * */
+   xmit_prepare_pkts(dev, tpktlist);
+
+   /* bye packets */
HARD_TX_LOCK(dev, smp_processor_id());
-   ret

[ofa-general] [PATCH 3/3][NET_BATCH] kill dev-gso_skb

2007-10-07 Thread jamal
This patch removes dev-gso_skb as it is no longer necessary with
batching code.

cheers,
jamal
[NET_BATCH] kill dev-gso_skb
The batching code does what gso used to batch at the drivers.
There is no more need for gso_skb. If for whatever reason the
requeueing is a bad idea we are going to leave packets in dev-blist
(and still not need dev-gso_skb)

Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED]

---
commit 7ebf50f0f43edd4897b88601b4133612fc36af61
tree 5d942ecebc14de6254ab3c812d542d524e148e92
parent cd602aa5f84fcef6359852cd99c95863eeb91015
author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 07 Oct 2007 09:30:19 -0400
committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 07 Oct 2007 09:30:19 -0400

 include/linux/netdevice.h |3 ---
 net/sched/sch_generic.c   |   12 
 2 files changed, 0 insertions(+), 15 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b31df5c..4ddc6eb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -577,9 +577,6 @@ struct net_device
struct list_headqdisc_list;
unsigned long   tx_queue_len;   /* Max frames per queue allowed 
*/
 
-   /* Partially transmitted GSO packet. */
-   struct sk_buff  *gso_skb;
-
/* ingress path synchronizer */
spinlock_t  ingress_lock;
struct Qdisc*qdisc_ingress;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 80ac56b..772e7fe 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -172,13 +172,6 @@ static int xmit_get_pkts(struct net_device *dev,
struct sk_buff *skb;
int count = dev-xmit_win;
 
-   if (count   dev-gso_skb) {
-   skb = dev-gso_skb;
-   dev-gso_skb = NULL;
-   count -= xmit_count_skbs(skb);
-   __skb_queue_tail(pktlist, skb);
-   }
-
while (count  0) {
skb = q-dequeue(q);
if (!skb)
@@ -659,7 +652,6 @@ void dev_activate(struct net_device *dev)
 void dev_deactivate(struct net_device *dev)
 {
struct Qdisc *qdisc;
-   struct sk_buff *skb;
 
spin_lock_bh(dev-queue_lock);
qdisc = dev-qdisc;
@@ -667,15 +659,11 @@ void dev_deactivate(struct net_device *dev)
 
qdisc_reset(qdisc);
 
-   skb = dev-gso_skb;
-   dev-gso_skb = NULL;
if (!skb_queue_empty(dev-blist))
skb_queue_purge(dev-blist);
dev-xmit_win = 1;
spin_unlock_bh(dev-queue_lock);
 
-   kfree_skb(skb);
-
dev_watchdog_down(dev);
 
/* Wait for outstanding dev_queue_xmit calls. */
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] NET_BATCH: some results

2007-10-07 Thread jamal

It seems prettier to just draw graphs and since this one is small file;
here it is attached. The graph demos a patched net-2.6.24 vs a plain
net-2.6.24 kernel with a udp app that sends on 4 CPUs as fast as the the
lower layers would allow it. 
Refer to my earlier description of the test setup etc.
As i noted earlier on, for this hardware at about 200B or so, we
approach wire speed, so the app is mostly idle above that as the link
becomes the bottleneck; example it is  85% idle on 512B and  90% idle
on 1024B. This is so for either batch or non-batch. So the
differentiation is really in the smaller sized packets.

Enjoy!

cheers,
jamal


batch-pps.pdf
Description: Adobe PDF document
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-03 Thread jamal
On Wed, 2007-03-10 at 01:29 -0400, Bill Fink wrote:

 It does sound sensible.  My own decidedly non-expert speculation
 was that the big 30 % performance hit right at 4 KB may be related
 to memory allocation issues or having to split the skb across
 multiple 4 KB pages.  

plausible. But i also worry it could be 10 other things; example, could
it be the driver used? I noted in my udp test the oddity that turned out
to be tx coal parameter related.
In any case, I will attempt to run those tests later.

 And perhaps it only affected the single
 process case because with multiple processes lock contention may
 be a bigger issue and the xmit batching changes would presumably
 help with that.  I am admittedly a novice when it comes to the
 detailed internals of TCP/skb processing, although I have been
 slowly slogging my way through parts of the TCP kernel code to
 try and get a better understanding, so I don't know if these
 thoughts have any merit.

You do bring up issues that need to be looked into and i will run those
tests.
Note, the effectiveness of batching becomes evident as the number of
flows grows. Actually, scratch that: It becomes evident if you can keep
the tx path busyed out to which multiple users running contribute. If i
can have a user per CPU with lots of traffic to send, i can create that
condition. It's a little boring in the scenario where the bottleneck is
the wire but it needs to be checked.

 BTW does anyone know of a good book they would recommend that has
 substantial coverage of the Linux kernel TCP code, that's fairly
 up-to-date and gives both an overall view of the code and packet
 flow as well as details on individual functions and algorithms,
 and hopefully covers basic issues like locking and synchronization,
 concurrency of different parts of the stack, and memory allocation.
 I have several books already on Linux kernel and networking internals,
 but they seem to only cover the IP (and perhaps UDP) portions of the
 network stack, and none have more than a cursory reference to TCP.  
 The most useful documentation on the Linux TCP stack that I have
 found thus far is some of Dave Miller's excellent web pages and
 a few other web references, but overall it seems fairly skimpy
 for such an important part of the Linux network code.

Reading books or magazines may end up busying you out with some small
gains of knowledge at the end. They tend to be outdated fast. My advice
is if you start with a focus on one thing, watch the patches that fly
around on that area and learn that way. Read the code to further
understand things then ask questions when its not clear. Other folks may
have different views. The other way to do it is pick yourself some task
to either add or improve something and get your hands dirty that way. 

 It would be good to see some empirical evidence that there aren't
 any unforeseen gotchas for larger packet sizes, that at least the
 same level of performance can be obtained with no greater CPU
 utilization.

Reasonable - I will try with 9K after i move over to the new tree from
Dave and make sure nothing else broke in the previous tests.
And when all looks good, i will move to TCP.


  [1] On average i spend 10x more time performance testing and analysing
  results than writting code.
 
 As you have written previously, and I heartily agree with, this is a
 very good practice for developing performance enhancement patches.

To give you a perspective, the results i posted were each run 10
iterations per packet size per kernel. Each run is 60 seconds long. I
think i am past that stage for resolving or fixing anything for UDP or
pktgen, but i need to keep checking for any new regressions when Dave
updates his tree. Now multiply that by 5 packet sizes (I am going to add
2 more) and multiply that by 3-4 kernels. Then add the time it takes to
sift through the data and collect it then analyze it and go back to the
drawing table when something doesnt look right.  Essentially, it needs a
weekend ;-

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-02 Thread jamal
On Tue, 2007-02-10 at 00:25 -0400, Bill Fink wrote:

 One reason I ask, is that on an earlier set of alternative batching
 xmit patches by Krishna Kumar, his performance testing showed a 30 %
 performance hit for TCP for a single process and a size of 4 KB, and
 a performance hit of 5 % for a single process and a size of 16 KB
 (a size of 8 KB wasn't tested).  Unfortunately I was too busy at the
 time to inquire further about it, but it would be a major potential
 concern for me in my 10-GigE network testing with 9000-byte jumbo
 frames.  Of course the single process and 4 KB or larger size was
 the only case that showed a significant performance hit in Krishna
 Kumar's latest reported test results, so it might be acceptable to
 just have a switch to disable the batching feature for that specific
 usage scenario.  So it would be useful to know if your xmit batching
 changes would have similar issues.


There were many times while testing that i noticed inconsistencies and
in each case when i analysed[1], i found it to be due to some variable
other than batching which needed some resolving, always via some
parametrization or other. I suspect what KK posted is in the same class.
To give you an example, with UDP, batching was giving worse results at
around 256B compared to 64B or 512B; investigating i found that the
receiver just wasnt able to keep up and the udp layer dropped a lot of
packets so both iperf and netperf reported bad numbers. Fixing the
receiver ended up with consistency coming back. On why 256B was the one
that overwhelmed the receiver more than 64B(which sent more pps)? On
some limited investigation, it seemed to me to be the effect of the
choice of the tg3 driver's default tx mitigation parameters as well tx
ring size; which is something i plan to revisit (but neutralizing it
helps me focus on just batching). In the end i dropped both netperf and
iperf for similar reasons and wrote my own app. What i am trying to
achieve is demonstrate if batching is a GoodThing. In experimentation
like this, it is extremely valuable to reduce the variables. Batching
may expose other orthogonal issues - those need to be resolved or fixed
as they are found. I hope that sounds sensible.

Back to the =9K packet size you raise above:
I dont have a 10Gige card so iam theorizing. Given that theres an
observed benefit to batching for a saturated link with smaller packets
(in my results small is anything below 256B which maps to about
380Kpps anything above that seems to approach wire speed and the link is
the bottleneck); then i theorize that 10Gige with 9K jumbo frames if
already achieving wire rate, should continue to do so. And sizes below
that will see improvements if they were not already hitting wire rate.
So i would say that with 10G NICS, there will be more observed
improvements with batching with apps that do bulk transfers (assuming
those apps are not seeing wire speed already). Note that this hasnt been
quiet the case even with TSO given the bottlenecks in the Linux
receivers that J Heffner put nicely in a response to some results you
posted - but that exposes an issue with Linux receivers rather than TSO.

 Also for your xmit batching changes, I think it would be good to see
 performance comparisons for TCP and IP forwarding in addition to your
 UDP pktgen tests, 

That is not pktgen - it is a udp app running in process context
utilizing all 4CPUs to send traffic. pktgen bypasses the stack entirely
and has its own merits in proving that batching infact is a GoodThing
even if it is just for traffic generation ;-

 including various packet sizes up to and including
 9000-byte jumbo frames.

I will do TCP and forwarding tests in the near future. 

cheers,
jamal

[1] On average i spend 10x more time performance testing and analysing
results than writting code.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-01 Thread jamal
On Mon, 2007-01-10 at 12:42 +0200, Patrick McHardy wrote:
 jamal wrote:

  +   while ((skb = __skb_dequeue(skbs)) != NULL)
  +   q-ops-requeue(skb, q);
 
 
 -requeue queues at the head, so this looks like it would reverse
 the order of the skbs.

Excellent catch!  thanks; i will fix.

As a side note: Any batching driver should _never_ have to requeue; if
it does it is buggy. And the non-batching ones if they ever requeue will
be a single packet, so not much reordering.

Thanks again Patrick.

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-01 Thread jamal
On Mon, 2007-01-10 at 00:11 -0400, Bill Fink wrote:

 Have you done performance comparisons for the case of using 9000-byte
 jumbo frames?

I havent, but will try if any of the gige cards i have support it.

As a side note: I have not seen any useful gains or losses as the packet
size approaches even 1500B MTU. For example, post about 256B neither the
batching nor the non-batching give much difference in either throughput
or cpu use. Below 256B, theres a noticeable gain for batching.
Note, in the cases of my tests all 4 CPUs are in full-throttle UDP and
so the occupancy of both the qdisc queue(s) and ethernet ring is
constantly high. For example at 512B, the app is 80% idle on all 4 CPUs
and we are hitting in the range of wire speed. We are at 90% idle at
1024B. This is the case with or without batching.  So my suspicion is
that with that trend a 9000B packet will just follow the same pattern.


cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [PATCHES] TX batching

2007-09-30 Thread jamal

Latest net-2.6.24 breaks the patches i posted last week; so this is an
update to resolve that. If you are receiving these emails and are
finding them overloading, please give me a shout and i will remove your
name.

Please provide feedback on the code and/or architecture.
Last time i posted them i received none. They are now updated to 
work with the latest net-2.6.24 from a few hours ago.

Patch 1: Introduces batching interface
Patch 2: Core uses batching interface
Patch 3: get rid of dev-gso_skb

I have decided i will kill -hard_batch_xmit() and not support any
more LLTX drivers. This is the last of patches that will have
-hard_batch_xmit() as i am supporting an e1000 that is LLTX.

Dave please let me know if this meets your desires to allow devices
which are SG and able to compute CSUM benefit just in case i
misunderstood. 
Herbert, if you can look at at least patch 3 i will appreaciate it
(since it kills dev-gso_skb that you introduced).

More patches to follow later if i get some feedback - i didnt want to 
overload people by dumping too many patches. Most of these patches 
mentioned below are ready to go; some need some re-testing and others 
need a little porting from an earlier kernel: 
- tg3 driver (tested and works well, but dont want to send 
- tun driver
- pktgen
- netiron driver
- e1000 driver (LLTX)
- e1000e driver (non-LLTX)
- ethtool interface
- There is at least one other driver promised to me

Theres also a driver-howto i wrote that was posted on netdev last week
as well as one that describes the architectural decisions made.

Each of these patches has been performance tested (last with DaveM's
tree from last weekend) and the results are in the logs on a per-patch 
basis.  My system under test hardware is a 2xdual core opteron with a 
couple of tg3s. I have not re-run the tests with this morning's tree
but i suspect not much difference.
My test tool generates udp traffic of different sizes for upto 60 
seconds per run or a total of 30M packets. I have 4 threads each 
running on a specific CPU which keep all the CPUs as busy as they can 
sending packets targetted at a directly connected box's udp discard
port.
All 4 CPUs target a single tg3 to send. The receiving box has a tc rule 
which counts and drops all incoming udp packets to discard port - this
allows me to make sure that the receiver is not the bottleneck in the
testing. Packet sizes sent are {64B, 128B, 256B, 512B, 1024B}. Each
packet size run is repeated 10 times to ensure that there are no
transients. The average of all 10 runs is then computed and collected.

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [PATCH 1/4] [NET_BATCH] Introduce batching interface

2007-09-30 Thread jamal

This patch introduces the netdevice interface for batching.

cheers,
jamal


[NET_BATCH] Introduce batching interface

This patch introduces the netdevice interface for batching.

A typical driver dev-hard_start_xmit() has 4 parts:
a) packet formating (example vlan, mss, descriptor counting etc)
b) chip specific formatting
c) enqueueing the packet on a DMA ring
d) IO operations to complete packet transmit, tell DMA engine to chew on,
tx completion interupts etc

[For code cleanliness/readability sake, regardless of this work,
one should break the dev-hard_start_xmit() into those 4 functions
anyways].

With the api introduced in this patch, a driver which has all
4 parts and needing to support batching is advised to split its
dev-hard_start_xmit() in the following manner:
1)use its dev-hard_prep_xmit() method to achieve #a
2)use its dev-hard_end_xmit() method to achieve #d
3)#b and #c can stay in -hard_start_xmit() (or whichever way you want
to do this)
Note: There are drivers which may need not support any of the two
methods (example the tun driver i patched) so the two methods are
optional.

The core will first do the packet formatting by invoking your supplied
dev-hard_prep_xmit() method. It will then pass you the packet via
your dev-hard_start_xmit() method and lastly will invoke your
dev-hard_end_xmit() when it completes passing you all the packets
queued for you. dev-hard_prep_xmit() is invoked without holding any
tx lock but the rest are under TX_LOCK().

LLTX present a challenge in that we have to introduce a deviation
from the norm and introduce the -hard_batch_xmit() method. An LLTX
driver presents us with -hard_batch_xmit() to which we pass it a list
of packets in a dev-blist skb queue. It is then the responsibility
of the -hard_batch_xmit() to exercise steps #b and #c for all packets
and #d when the batching is complete. Step #a is already done for you
by the time you get the packets in dev-blist.
And last xmit_win variable is introduced to ensure that when we pass
the driver a list of packets it will swallow all of them - which is
useful because we dont requeue to the qdisc (and avoids burning
unnecessary cpu cycles or introducing any strange re-ordering). The driver
tells us when it invokes netif_wake_queue how much space it has for
descriptors by setting this variable.

Some decisions i had to make:
- every driver will have a xmit_win variable and the core will set it
to 1 which means the behavior of non-batching drivers stays the same.
- the batch list, blist, is no longer a pointer; wastes a little extra
memmory i plan to recoup by killing gso_skb in later patches.

Theres a lot of history and reasoning of why batching in a document
i am writting which i may submit as a patch.
Thomas Graf (who doesnt know this probably) gave me the impetus to
start looking at this back in 2004 when he invited me to the linux
conference he was organizing. Parts of what i presented in SUCON in
2004 talk about batching. Herbert Xu forced me to take a second look around
2.6.18 - refer to my netconf 2006 presentation. Krishna Kumar provided
me with more motivation in May 2007 when he posted on netdev and engaged
me.
Sridhar Samudrala, Krishna Kumar, Matt Carlson, Michael Chan,
Jeremy Ethridge, Evgeniy Polyakov, Sivakumar Subramani, and
David Miller, have contributed in one or more of {bug fixes, enhancements,
testing, lively discussion}. The Broadcom and netiron folks have been
outstanding in their help.

Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED]

---
commit 624a0bfeb971c9aa58496c7372df01f0ed750def
tree c1c0ee53453392866a5241631a7502ce6569b2cc
parent 260dbcc4b0195897c539c5ff79d95afdddeb3378
author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 14:23:31 -0400
committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 14:23:31 -0400

 include/linux/netdevice.h |   17 +++
 net/core/dev.c|  106 +
 2 files changed, 123 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 91cd3f3..df1fb61 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -467,6 +467,7 @@ struct net_device
 #define NETIF_F_NETNS_LOCAL8192/* Does not change network namespaces */
 #define NETIF_F_MULTI_QUEUE16384   /* Has multiple TX/RX queues */
 #define NETIF_F_LRO32768   /* large receive offload */
+#define NETIF_F_BTX65536   /* Capable of batch tx */
 
/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT  16
@@ -595,6 +596,15 @@ struct net_device
void*priv;  /* pointer to private data  */
int (*hard_start_xmit) (struct sk_buff *skb,
struct net_device *dev);
+   /* hard_batch_xmit is needed for LLTX, kill it when those
+* disappear or better kill it now and dont support LLTX
+   */
+   int

[ofa-general] [PATCH 2/3][NET_BATCH] net core use batching

2007-09-30 Thread jamal

This patch adds the usage of batching within the core.

cheers,
jamal



[NET_BATCH] net core use batching

This patch adds the usage of batching within the core.
The same test methodology used in introducing txlock is used, with
the following results on different kernels:

++--+-+++
|   64B  |  128B| 256B| 512B   |1024B   |
++--+-+++
Original| 467482 | 463061   | 388267  | 216308 | 114704 |
||  | |||
txlock  | 468922 | 464060   | 388298  | 216316 | 114709 |
||  | |||
tg3nobtx| 468012 | 464079   | 388293  | 216314 | 114704 |
||  | |||
tg3btxdr| 480794 | 475102   | 388298  | 216316 | 114705 |
||  | |||
tg3btxco| 481059 | 475423   | 388285  | 216308 | 114706 |
++--+-+++

The first two colums Original and txlock were introduced in an earlier
patch and demonstrate a slight increase in performance with txlock.
tg3nobtx shows the tg3 driver with no changes to support batching.
The purpose of this test is to demonstrate the effect of introducing
the core changes to a driver that doesnt support them.
Although this patch brings down perfomance slightly compared to txlock
for such netdevices, it is still better compared to just the original
kernel.
tg3btxdr demonstrates the effect of using -hard_batch_xmit() with tg3
driver. tg3btxco demonstrates the effect of letting the core do all the
work. As can be seen the last two are not very different in performance.
The difference is -hard_batch_xmit() introduces a new method which
is intrusive.

I have #if-0ed some of the old functions so the patch is more readable.

Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED]

---
commit 9b4a8fb190278d388c0a622fb5529d184ac8c7dc
tree 053e8dda02b5d26fe7cc778823306a8a526df513
parent 624a0bfeb971c9aa58496c7372df01f0ed750def
author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 14:38:11 -0400
committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 14:38:11 -0400

 net/sched/sch_generic.c |  127 +++
 1 files changed, 115 insertions(+), 12 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 95ae119..86a3f9d 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -56,6 +56,7 @@ static inline int qdisc_qlen(struct Qdisc *q)
return q-q.qlen;
 }
 
+#if 0
 static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev,
  struct Qdisc *q)
 {
@@ -110,6 +111,97 @@ static inline int handle_dev_cpu_collision(struct sk_buff 
*skb,
 
return ret;
 }
+#endif
+
+static inline int handle_dev_cpu_collision(struct net_device *dev)
+{
+   if (unlikely(dev-xmit_lock_owner == smp_processor_id())) {
+   if (net_ratelimit())
+   printk(KERN_WARNING
+   Dead loop on netdevice %s, fix it urgently!\n,
+   dev-name);
+   return 1;
+   }
+   __get_cpu_var(netdev_rx_stat).cpu_collision++;
+   return 0;
+}
+
+static inline int
+dev_requeue_skbs(struct sk_buff_head *skbs, struct net_device *dev,
+  struct Qdisc *q)
+{
+
+   struct sk_buff *skb;
+
+   while ((skb = __skb_dequeue(skbs)) != NULL)
+   q-ops-requeue(skb, q);
+
+   netif_schedule(dev);
+   return 0;
+}
+
+static inline int
+xmit_islocked(struct sk_buff_head *skbs, struct net_device *dev,
+   struct Qdisc *q)
+{
+   int ret = handle_dev_cpu_collision(dev);
+
+   if (ret) {
+   if (!skb_queue_empty(skbs))
+   skb_queue_purge(skbs);
+   return qdisc_qlen(q);
+   }
+
+   return dev_requeue_skbs(skbs, dev, q);
+}
+
+static int xmit_count_skbs(struct sk_buff *skb)
+{
+   int count = 0;
+   for (; skb; skb = skb-next) {
+   count += skb_shinfo(skb)-nr_frags;
+   count += 1;
+   }
+   return count;
+}
+
+static int xmit_get_pkts(struct net_device *dev,
+  struct Qdisc *q,
+  struct sk_buff_head *pktlist)
+{
+   struct sk_buff *skb;
+   int count = dev-xmit_win;
+
+   if (count   dev-gso_skb) {
+   skb = dev-gso_skb;
+   dev-gso_skb = NULL;
+   count -= xmit_count_skbs(skb);
+   __skb_queue_tail(pktlist, skb);
+   }
+
+   while (count  0) {
+   skb = q-dequeue(q);
+   if (!skb)
+   break

[ofa-general] [PATCH 3/3][NET_SCHED] kill dev-gso_skb

2007-09-30 Thread jamal

This patch removes dev-gso_skb as it is no longer necessary with
batching code.

cheers,
jamal


[NET_SCHED] kill dev-gso_skb
The batching code does what gso used to batch at the drivers.
There is no more need for gso_skb. If for whatever reason the
requeueing is a bad idea we are going to leave packets in dev-blist
(and still not need dev-gso_skb)

Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED]

---
commit c2916c550d228472ddcdd676c2689fa6c8ecfcc0
tree 5beaf197fd08a038d83501f405017f48712d0318
parent 9b4a8fb190278d388c0a622fb5529d184ac8c7dc
author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 14:38:58 -0400
committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 14:38:58 -0400

 include/linux/netdevice.h |3 ---
 net/sched/sch_generic.c   |   12 
 2 files changed, 0 insertions(+), 15 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index df1fb61..cea400a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -577,9 +577,6 @@ struct net_device
struct list_headqdisc_list;
unsigned long   tx_queue_len;   /* Max frames per queue allowed 
*/
 
-   /* Partially transmitted GSO packet. */
-   struct sk_buff  *gso_skb;
-
/* ingress path synchronizer */
spinlock_t  ingress_lock;
struct Qdisc*qdisc_ingress;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 86a3f9d..b4e1607 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -172,13 +172,6 @@ static int xmit_get_pkts(struct net_device *dev,
struct sk_buff *skb;
int count = dev-xmit_win;
 
-   if (count   dev-gso_skb) {
-   skb = dev-gso_skb;
-   dev-gso_skb = NULL;
-   count -= xmit_count_skbs(skb);
-   __skb_queue_tail(pktlist, skb);
-   }
-
while (count  0) {
skb = q-dequeue(q);
if (!skb)
@@ -654,7 +647,6 @@ void dev_activate(struct net_device *dev)
 void dev_deactivate(struct net_device *dev)
 {
struct Qdisc *qdisc;
-   struct sk_buff *skb;
 
spin_lock_bh(dev-queue_lock);
qdisc = dev-qdisc;
@@ -662,15 +654,11 @@ void dev_deactivate(struct net_device *dev)
 
qdisc_reset(qdisc);
 
-   skb = dev-gso_skb;
-   dev-gso_skb = NULL;
if (!skb_queue_empty(dev-blist))
skb_queue_purge(dev-blist);
dev-xmit_win = 1;
spin_unlock_bh(dev-queue_lock);
 
-   kfree_skb(skb);
-
dev_watchdog_down(dev);
 
/* Wait for outstanding dev_queue_xmit calls. */
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 1/3] [NET_BATCH] Introduce batching interface

2007-09-30 Thread jamal
Fixed subject - should be 1/3 not 1/4

On Sun, 2007-30-09 at 14:51 -0400, jamal wrote:
 This patch introduces the netdevice interface for batching.
 
 cheers,
 jamal
 
 

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCHES] TX batching

2007-09-30 Thread jamal
And heres a patch that provides a sample of the usage for batching with
tg3. 
Requires patch [TG3]Some cleanups i posted earlier. 

cheers,
jamal

diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index 5a864bd..9aafb78 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -3103,6 +3103,13 @@ static inline u32 tg3_tx_avail(struct tg3 *tp)
 		((tp-tx_prod - tp-tx_cons)  (TG3_TX_RING_SIZE - 1)));
 }
 
+static inline void tg3_set_win(struct tg3 *tp)
+{
+	tp-dev-xmit_win = tg3_tx_avail(tp) - (MAX_SKB_FRAGS + 1);
+	if (tp-dev-xmit_win  1)
+		tp-dev-xmit_win = 1;
+}
+
 /* Tigon3 never reports partial packet sends.  So we do not
  * need special logic to handle SKBs that have not had all
  * of their frags sent yet, like SunGEM does.
@@ -3165,8 +3172,10 @@ static void tg3_tx(struct tg3 *tp)
 		 (tg3_tx_avail(tp)  TG3_TX_WAKEUP_THRESH(tp {
 		netif_tx_lock(tp-dev);
 		if (netif_queue_stopped(tp-dev) 
-		(tg3_tx_avail(tp)  TG3_TX_WAKEUP_THRESH(tp)))
+		(tg3_tx_avail(tp)  TG3_TX_WAKEUP_THRESH(tp))) {
+			tg3_set_win(tp);
 			netif_wake_queue(tp-dev);
+		}
 		netif_tx_unlock(tp-dev);
 	}
 }
@@ -4007,8 +4016,13 @@ void tg3_kick_DMA(struct net_device *dev)
 
 	if (unlikely(tg3_tx_avail(tp) = (MAX_SKB_FRAGS + 1))) {
 		netif_stop_queue(dev);
-		if (tg3_tx_avail(tp) = TG3_TX_WAKEUP_THRESH(tp))
+		dev-xmit_win = 1;
+		if (tg3_tx_avail(tp) = TG3_TX_WAKEUP_THRESH(tp)) {
+			tg3_set_win(tp);
 			netif_wake_queue(dev);
+		}
+	} else {
+		tg3_set_win(tp);
 	}
 
 	mmiowb();
@@ -4085,6 +4099,7 @@ static int tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	if (unlikely(tg3_tx_avail(tp) = (skb_shinfo(skb)-nr_frags + 1))) {
 		if (!netif_queue_stopped(dev)) {
 			netif_stop_queue(dev);
+			tp-dev-xmit_win = 1;
 
 			/* This is a hard error, log it. */
 			printk(KERN_ERR PFX %s: BUG! Tx Ring full when 
@@ -4100,6 +4115,25 @@ static int tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	return ret;
 }
 
+static int tg3_start_bxmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct tg3 *tp = netdev_priv(dev);
+
+	if (unlikely(tg3_tx_avail(tp) = (skb_shinfo(skb)-nr_frags + 1))) {
+		if (!netif_queue_stopped(dev)) {
+			netif_stop_queue(dev);
+			dev-xmit_win = 1;
+
+			/* This is a hard error, log it. */
+			printk(KERN_ERR PFX %s: BUG! Tx Ring full when 
+			   queue awake!\n, dev-name);
+		}
+		return NETDEV_TX_BUSY;
+	}
+
+	return tg3_enqueue(skb, dev);
+}
+
 static int tg3_start_xmit_dma_bug(struct sk_buff *, struct net_device *);
 
 /* Use GSO to workaround a rare TSO bug that may be triggered when the
@@ -4112,9 +4146,11 @@ static int tg3_tso_bug(struct tg3 *tp, struct sk_buff *skb)
 	/* Estimate the number of fragments in the worst case */
 	if (unlikely(tg3_tx_avail(tp) = (skb_shinfo(skb)-gso_segs * 3))) {
 		netif_stop_queue(tp-dev);
+		tp-dev-xmit_win = 1;
 		if (tg3_tx_avail(tp) = (skb_shinfo(skb)-gso_segs * 3))
 			return NETDEV_TX_BUSY;
 
+		tg3_set_win(tp);
 		netif_wake_queue(tp-dev);
 	}
 
@@ -4267,6 +4303,25 @@ static int tg3_enqueue_buggy(struct sk_buff *skb, struct net_device *dev)
 	return NETDEV_TX_OK;
 }
 
+static int tg3_start_bxmit_dma_bug(struct sk_buff *skb, struct net_device *dev)
+{
+	struct tg3 *tp = netdev_priv(dev);
+
+	if (unlikely(tg3_tx_avail(tp) = (skb_shinfo(skb)-nr_frags + 1))) {
+		if (!netif_queue_stopped(dev)) {
+			netif_stop_queue(dev);
+			dev-xmit_win = 1;
+
+			/* This is a hard error, log it. */
+			printk(KERN_ERR PFX %s: BUG! Tx Ring full when 
+			   queue awake!\n, dev-name);
+		}
+		return NETDEV_TX_BUSY;
+	}
+
+	return  tg3_enqueue_buggy(skb, dev);
+}
+
 static int tg3_start_xmit_dma_bug(struct sk_buff *skb, struct net_device *dev)
 {
 	struct tg3 *tp = netdev_priv(dev);
@@ -4283,6 +4338,7 @@ static int tg3_start_xmit_dma_bug(struct sk_buff *skb, struct net_device *dev)
 	if (unlikely(tg3_tx_avail(tp) = (skb_shinfo(skb)-nr_frags + 1))) {
 		if (!netif_queue_stopped(dev)) {
 			netif_stop_queue(dev);
+			dev-xmit_win = 1;
 
 			/* This is a hard error, log it. */
 			printk(KERN_ERR PFX %s: BUG! Tx Ring full when 
@@ -11099,15 +11155,19 @@ static int __devinit tg3_get_invariants(struct tg3 *tp)
 	else
 		tp-tg3_flags = ~TG3_FLAG_POLL_SERDES;
 
+	tp-dev-hard_end_xmit = tg3_kick_DMA;
 	/* All chips before 5787 can get confused if TX buffers
 	 * straddle the 4GB address boundary in some cases.
 	 */
 	if (GET_ASIC_REV(tp-pci_chip_rev_id) == ASIC_REV_5755 ||
 	GET_ASIC_REV(tp-pci_chip_rev_id) == ASIC_REV_5787 ||
-	GET_ASIC_REV(tp-pci_chip_rev_id) == ASIC_REV_5906)
-		tp-dev-hard_start_xmit = tg3_start_xmit;
-	else
-		tp-dev-hard_start_xmit = tg3_start_xmit_dma_bug;
+	GET_ASIC_REV(tp-pci_chip_rev_id) == ASIC_REV_5906) {
+		tp-dev-hard_start_xmit = tg3_start_bxmit;
+		tp-dev-hard_prep_xmit = tg3_prep_frame;
+	} else {
+		tp-dev-hard_start_xmit = tg3_start_bxmit_dma_bug;
+		tp-dev-hard_prep_xmit = tg3_prep_bug_frame;
+	}
 
 	tp-rx_offset = 2;
 	if (GET_ASIC_REV(tp-pci_chip_rev_id) == ASIC_REV_5701 
@@ -11955,6 +12015,8

[ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock

2007-09-25 Thread jamal
On Mon, 2007-24-09 at 16:47 -0700, Waskiewicz Jr, Peter P wrote:

 We should make sure we're symmetric with the locking on enqueue to
 dequeue.  If we use the single device queue lock on enqueue, then
 dequeue will also need to check that lock in addition to the individual
 queue lock.  The details of this are more trivial than the actual
 dequeue to make it efficient though.

It would be interesting to observe the performance implications.

 The dequeue locking would be pushed into the qdisc itself.  This is how
 I had it originally, and it did make the code more complex, but it was
 successful at breaking the heavily-contended queue_lock apart.  I have a
 subqueue structure right now in netdev, which only has queue_state (for
 netif_{start|stop}_subqueue).  This state is checked in sch_prio right
 now in the dequeue for both prio and rr.  My approach is to add a
 queue_lock in that struct, so each queue allocated by the driver would
 have a lock per queue.  Then in dequeue, that lock would be taken when
 the skb is about to be dequeued.

more locks implies degraded performance. If only one processor can enter
that region, presumably after acquiring the outer lock , why this
secondary lock per queue?

 The skb-queue_mapping field also maps directly to the queue index
 itself, so it can be unlocked easily outside of the context of the
 dequeue function.  The policy would be to use a spin_trylock() in
 dequeue, so that dequeue can still do work if enqueue or another dequeue
 is busy.  

So there could be a parallel cpu dequeueing at the same time?
Wouldnt this have implications depending on what the scheduling
algorithm used? If for example i was doing priority queueing i would
want to make sure the highest priority is being dequeued first AND by
all means goes out first to the driver; i dont want a parallell cpu
dequeing a lower prio packet at the same time.

 And the allocation of qdisc queues to device queues is assumed
 to be one-to-one (that's how the qdisc behaves now).

Ok, that brings back the discussion we had; my thinking was something
like dev-hard_prep_xmit() would select the ring and i think you
staticly already map the ring to a qdisc queue. So i dont think 
dev-hard_prep_xmit() is useful to you.
In any case, there is nothing the batching patches do that interfere or
prevent you from going the path you intend to. instead of dequeueing one
packet, you dequeue several and instead of sending to the driver one
packet, you send several. And using the xmit_win, you should never ever
have to requeue.

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock

2007-09-25 Thread jamal
On Mon, 2007-24-09 at 17:14 -0700, Stephen Hemminger wrote:

 Since we are redoing this, 
 is there any way to make the whole TX path
 more lockless?  The existing model seems to be more of a monitor than
 a real locking model.

What do you mean it is more of a monitor?

On the challenge of making it lockless:
About every NAPI driver combines the tx prunning with rx polling. If you
are dealing with tx resources on receive thread as well as tx thread,
_you need_ locking. The only other way we can do avoid it is to separate
the rx path interupts from ones on tx related resources; the last NAPI
driver that did that was tulip; i think the e1000 for a short period in
its life did the same as well. But that has been frowned on and people
have evolved away from it.

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock

2007-09-25 Thread jamal
On Tue, 2007-25-09 at 08:24 -0700, Stephen Hemminger wrote:

 The transmit code path is locked as a code region, rather than just object 
 locking
 on the transmit queue or other fine grained object. This leads to moderately 
 long
 lock hold times when multiple qdisc's and classification is being done.

It will be pretty tricky to optimize that path given the dependencies
between the queues, classifiers, and actions in enqueues; schedulers in
dequeues as well as their config/queries from user space which could
happen concurently on all N CPUs. 
The txlock optimization i added in patch1 intends to let go of the queue
lock when we enter the dequeue region sooner to reduce the contention.

A further optimization i made was to reduce the time it takes to hold
the tx lock at the driver by moving gunk that doesnt need lock-holding
into the new method dev-hard_end_xmit() (refer to patch #2)

 If we went to finer grain locking it would also mean changes to all network
 devices using the new locking model. My assumption is that we would use
 something like the features flag to do the transition for backward 
 compatibility.
 Take this as a purely what if or it would be nice if kind of suggestion
 not a requirement or some grand plan.

Ok, hopefully someone would demonstrate how to achieve it; seems a hard
thing to achieve.

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [DOC] Net batching driver howto

2007-09-25 Thread jamal
On Tue, 2007-25-09 at 13:16 -0700, Randy Dunlap wrote:
 On Mon, 24 Sep 2007 18:54:19 -0400 jamal wrote:
 
  I have updated the driver howto to match the patches i posted yesterday.
  attached. 
 
 Thanks for sending this.

Thank you for reading it Randy. 

 This is an early draft, right?

Its a third revision - but you could call it early. When it is done, i
will probably put a pointer to it in some patch.

 I'll fix some typos etc. in it (patch attached) and add some whitespace.
 Please see RD: in the patch for more questions/comments.

Thanks, will do and changes will show up in the next update.

 IMO it needs some changes to eliminate words like we, you,
 and your (words that personify code).  Those words are OK
 when talking about yourself.

The narrative intent is supposed to be i (or someone doing the
description) sitting there with a pen and paper and maybe a laptop and
walking through the details with someone who needs to understand those
details. If you think it is important to make it formal, then by all
means be my guest.

Again, thanks for taking the time.

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock

2007-09-25 Thread jamal
On Tue, 2007-25-09 at 18:15 -0400, jamal wrote:

 A further optimization i made was to reduce the time it takes to hold
 the tx lock at the driver by moving gunk that doesnt need lock-holding
 into the new method dev-hard_end_xmit() (refer to patch #2)

Sorry, that should have read dev-hard_prep_xmit()

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock

2007-09-24 Thread jamal
On Mon, 2007-24-09 at 12:12 -0700, Waskiewicz Jr, Peter P wrote:

 Hi Jamal,
   I've been (slowly) working on resurrecting the original design
 of my multiqueue patches to address this exact issue of the queue_lock
 being a hot item.  I added a queue_lock to each queue in the subqueue
 struct, and in the enqueue and dequeue, just lock that queue instead of
 the global device queue_lock.  The only two issues to overcome are the
 QDISC_RUNNING state flag, since that also serializes entry into the
 qdisc_restart() function, and the qdisc statistics maintenance, which
 needs to be serialized.  Do you think this work along with your patch
 will benefit from one another? 

The one thing that seems obvious is to use dev-hard_prep_xmit() in the
patches i posted to select the xmit ring. You should be able to do
figure out the txmit ring without holding any lock. 

I lost track of how/where things went since the last discussion; so i
need to wrap my mind around it to make sensisble suggestions - I know
the core patches are in the kernel but havent paid attention to details
and if you look at my second patch youd see a comment in
dev_batch_xmit() which says i need to scrutinize multiqueue more. 

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [DOC] Net batching driver howto

2007-09-24 Thread jamal

I have updated the driver howto to match the patches i posted yesterday.
attached. 

cheers,
jamal
Heres the begining of a howto for driver authors.

The intended audience for this howto is people already
familiar with netdevices.

1.0  Netdevice Pre-requisites
--

For hardware based netdevices, you must have at least hardware that 
is capable of doing DMA with many descriptors; i.e having hardware 
with a queue length of 3 (as in some fscked ethernet hardware) is 
not very useful in this case.

2.0  What is new in the driver API
---

There are 3 new methods and one new variable introduced. These are:
1)dev-hard_prep_xmit()
2)dev-hard_end_xmit()
3)dev-hard_batch_xmit()
4)dev-xmit_win

2.1 Using Core driver changes
-

To provide context, lets look at a typical driver abstraction
for dev-hard_start_xmit(). It has 4 parts:
a) packet formating (example vlan, mss, descriptor counting etc)
b) chip specific formatting
c) enqueueing the packet on a DMA ring
d) IO operations to complete packet transmit, tell DMA engine to chew 
on, tx completion interupts etc

[For code cleanliness/readability sake, regardless of this work,
one should break the dev-hard_start_xmit() into those 4 functions
anyways].
A driver which has all 4 parts and needing to support batching is 
advised to split its dev-hard_start_xmit() in the following manner:
1)use its dev-hard_prep_xmit() method to achieve #a
2)use its dev-hard_end_xmit() method to achieve #d
3)#b and #c can stay in -hard_start_xmit() (or whichever way you 
want to do this)
Note: There are drivers which may need not support any of the two
methods (example the tun driver i patched) so the two methods are
essentially optional.

2.1.1 Theory of operation
--

The core will first do the packet formatting by invoking your 
supplied dev-hard_prep_xmit() method. It will then pass you the packet 
via your dev-hard_start_xmit() method for as many as packets you
have advertised (via dev-xmit_win) you can consume. Lastly it will 
invoke your dev-hard_end_xmit() when it completes passing you all the 
packets queued for you. 


2.1.1.1 Locking rules
-

dev-hard_prep_xmit() is invoked without holding any
tx lock but the rest are under TX_LOCK(). So you have to ensure that
whatever you put it dev-hard_prep_xmit() doesnt require locking.

2.1.1.2 The slippery LLTX
-

LLTX drivers present a challenge in that we have to introduce a deviation
from the norm and require the -hard_batch_xmit() method. An LLTX
driver presents us with -hard_batch_xmit() to which we pass it a list
of packets in a dev-blist skb queue. It is then the responsibility
of the -hard_batch_xmit() to exercise steps #b and #c for all packets
passed in the dev-blist.
Step #a and #d are done by the core should you register presence of
dev-hard_prep_xmit() and dev-hard_end_xmit() in your setup.

2.1.1.3 xmit_win


dev-xmit_win variable is set by the driver to tell us how
much space it has in its rings/queues. dev-xmit_win is introduced to 
ensure that when we pass the driver a list of packets it will swallow 
all of them - which is useful because we dont requeue to the qdisc (and 
avoids burning unnecessary cpu cycles or introducing any strange 
re-ordering). The driver tells us, whenever it invokes netif_wake_queue,
how much space it has for descriptors by setting this variable.

3.0 Driver Essentials
-

The typical driver tx state machine is:


-1- +Core sends packets
 +-- Driver puts packet onto hardware queue
 +if hardware queue is full, netif_stop_queue(dev)
 +
-2- +core stops sending because of netif_stop_queue(dev)
..
.. time passes ...
..
-3- +--- driver has transmitted packets, opens up tx path by
  invoking netif_wake_queue(dev)
-1- +Cycle repeats and core sends more packets (step 1).


3.1  Driver pre-requisite
--

This is _a very important_ requirement in making batching useful.
The pre-requisite for batching changes is that the driver should 
provide a low threshold to open up the tx path.
Drivers such as tg3 and e1000 already do this.
Before you invoke netif_wake_queue(dev) you check if there is a
threshold of space reached to insert new packets.

Heres an example of how i added it to tun driver. Observe the
setting of dev-xmit_win

---
+#define NETDEV_LTT 4 /* the low threshold to open up the tx path */
..
..
u32 t = skb_queue_len(tun-readq);
if (netif_queue_stopped(tun-dev)  t  NETDEV_LTT) {
tun-dev-xmit_win = tun-dev-tx_queue_len;
netif_wake_queue(tun-dev);
}
---

Heres how the batching e1000 driver does it:

--
if (unlikely(cleaned  netif_carrier_ok(netdev) 
 E1000_DESC_UNUSED(tx_ring) = TX_WAKE_THRESHOLD)) {

if (netif_queue_stopped(netdev)) {
   int rspace =  E1000_DESC_UNUSED(tx_ring

[ofa-general] RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock

2007-09-24 Thread jamal
On Mon, 2007-24-09 at 15:57 -0700, Waskiewicz Jr, Peter P wrote:

 I've looked at that as a candidate to use.  The lock for enqueue would
 be needed when actually placing the skb into the appropriate software
 queue for the qdisc, so it'd be quick.

The enqueue is easy to comprehend. The single device queue lock should
suffice. The dequeue is interesting:
Maybe you can point me to some doc or describe to me the dequeue aspect;
are you planning to have an array of txlocks per, one per ring?
How is the policy to define the qdisc queues locked/mapped to tx rings? 

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [PATCHES] TX batching

2007-09-23 Thread jamal

I had plenty of time this weekend so i have been doing a _lot_ of
testing.  My next emails will send a set of patches:
 
Patch 1: Introduces explicit tx locking
Patch 2: Introduces batching interface
Patch 3: Core uses batching interface
Patch 4: get rid of dev-gso_skb

Testing
---
Each of these patches has been performance tested and the results
are in the logs on a per-patch basis. 
My system under test hardware is a 2xdual core opteron with a couple of 
tg3s. 
My test tool generates udp traffic of different sizes for upto 60 
seconds per run or a total of 30M packets. I have 4 threads each 
running on a specific CPU which keep all the CPUs as busy as they can 
sending packets targetted at a directly connected box's udp discard
port.

All 4 CPUs target a single tg3 to send. The receiving box has a tc rule 
which counts and drops all incoming udp packets to discard port - this
allows me to make sure that the receiver is not the bottleneck in the
testing. Packet sizes sent are {64B, 128B, 256B, 512B, 1024B}. Each
packet size run is repeated 10 times to ensure that there are no
transients. The average of all 10 runs is then computed and collected.

I have not run testing on patch #4 because i had to let the machine
go, but will have some access to it tommorow early morning where i can
run some tests.

Comments

Iam trying to kill -hard_batch_xmit() but it would be tricky to do
without it for LLTX drivers. Anything i try will require a few extra
checks. OTOH, I could kill LLTX for the drivers i am using that
are LLTX and then drop that interface or I could say no support
for LLTX. I am in a dilema.

Dave please let me know if this meets your desires to allow devices
which are SG and able to compute CSUM benefit just in case i
misunderstood. 
Herbert, if you can look at at least patch 4 i will appreaciate it.

More patches to follow  - i didnt want to overload people by dumping 
too many patches. Most of these patches below are ready to go; some are
need some testing and others need a little porting from an earlier
kernel: 
- tg3 driver (tested and works well, but dont want to send 
- tun driver
- pktgen
- netiron driver
- e1000 driver
- ethtool interface
- There is at least one other driver promised to me

I am also going to update the two documents i posted earlier.
Hopefully i can do that today.

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [PATCH 1/4] [NET_SCHED] explict hold dev tx lock

2007-09-23 Thread jamal

I have submitted this before; but here it is again.
Against net-2.6.24 from yesterday for this and all following patches. 


cheers,
jamal

[NET_SCHED] explict hold dev tx lock

For N cpus, with full throttle traffic on all N CPUs, funneling traffic
to the same ethernet device, the devices queue lock is contended by all
N CPUs constantly. The TX lock is only contended by a max of 2 CPUS.
In the current mode of qdisc operation, when all N CPUs contend for
the dequeue region and one of them (after all the work) entering
dequeue region, we may endup aborting the path if we are unable to get
the tx lock and go back to contend for the queue lock. As N goes up,
this gets worse.

The changes in this patch result in a small increase in performance
with a 4CPU (2xdual-core) with no irq binding. My tests are UDP
based and keep all 4CPUs busy all the time for the period of the test.
Both e1000 and tg3 showed similar behavior.
I expect higher gains with more CPUs. Summary below with different
UDP packets and the resulting pps seen. Note at around 200Bytes,
the two dont seem that much different and we are approaching wire
speed (with plenty of CPU available; eg at 512B, the app is sitting
at 80% idle on both cases).

++--+-+++
pktsize |   64B  |  128B| 256B| 512B   |1024B   |
++--+-+++
Original| 467482 | 463061   | 388267  | 216308 | 114704 |
||  | |||
txlock  | 468922 | 464060   | 388298  | 216316 | 114709 |
-

Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED]

---
commit b0e36991c5850dfe930f80ee508b08fdcabc18d1
tree b1787bba26f80a325298f89d1ec882cc5ab524ae
parent 42765047105fdd496976bc1784d22eec1cd9b9aa
author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 23 Sep 2007 09:09:17 -0400
committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 23 Sep 2007 09:09:17 -0400

 net/sched/sch_generic.c |   19 ++-
 1 files changed, 2 insertions(+), 17 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index e970e8e..95ae119 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -134,34 +134,19 @@ static inline int qdisc_restart(struct net_device *dev)
 {
struct Qdisc *q = dev-qdisc;
struct sk_buff *skb;
-   unsigned lockless;
int ret;
 
/* Dequeue packet */
if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL))
return 0;
 
-   /*
-* When the driver has LLTX set, it does its own locking in
-* start_xmit. These checks are worth it because even uncongested
-* locks can be quite expensive. The driver can do a trylock, as
-* is being done here; in case of lock contention it should return
-* NETDEV_TX_LOCKED and the packet will be requeued.
-*/
-   lockless = (dev-features  NETIF_F_LLTX);
-
-   if (!lockless  !netif_tx_trylock(dev)) {
-   /* Another CPU grabbed the driver tx lock */
-   return handle_dev_cpu_collision(skb, dev, q);
-   }
 
/* And release queue */
spin_unlock(dev-queue_lock);
 
+   HARD_TX_LOCK(dev, smp_processor_id());
ret = dev_hard_start_xmit(skb, dev);
-
-   if (!lockless)
-   netif_tx_unlock(dev);
+   HARD_TX_UNLOCK(dev);
 
spin_lock(dev-queue_lock);
q = dev-qdisc;
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 2/4] [NET_BATCH] Introduce batching interface

2007-09-23 Thread jamal

This patch introduces the netdevice interface for batching.

cheers,
jamal

[NET_BATCH] Introduce batching interface

This patch introduces the netdevice interface for batching.

A typical driver dev-hard_start_xmit() has 4 parts:
a) packet formating (example vlan, mss, descriptor counting etc)
b) chip specific formatting
c) enqueueing the packet on a DMA ring
d) IO operations to complete packet transmit, tell DMA engine to chew on,
tx completion interupts etc

[For code cleanliness/readability sake, regardless of this work,
one should break the dev-hard_start_xmit() into those 4 functions
anyways].

With the api introduced in this patch, a driver which has all
4 parts and needing to support batching is advised to split its
dev-hard_start_xmit() in the following manner:
1)use its dev-hard_prep_xmit() method to achieve #a
2)use its dev-hard_end_xmit() method to achieve #d
3)#b and #c can stay in -hard_start_xmit() (or whichever way you want
to do this)
Note: There are drivers which may need not support any of the two
methods (example the tun driver i patched) so the two methods are
optional.

The core will first do the packet formatting by invoking your supplied
dev-hard_prep_xmit() method. It will then pass you the packet via
your dev-hard_start_xmit() method and lastly will invoke your
dev-hard_end_xmit() when it completes passing you all the packets
queued for you. dev-hard_prep_xmit() is invoked without holding any
tx lock but the rest are under TX_LOCK().

LLTX present a challenge in that we have to introduce a deviation
from the norm and introduce the -hard_batch_xmit() method. An LLTX
driver presents us with -hard_batch_xmit() to which we pass it a list
of packets in a dev-blist skb queue. It is then the responsibility
of the -hard_batch_xmit() to exercise steps #b and #c for all packets
and #d when the batching is complete. Step #a is already done for you
by the time you get the packets in dev-blist.
And last xmit_win variable is introduced to ensure that when we pass
the driver a list of packets it will swallow all of them - which is
useful because we dont requeue to the qdisc (and avoids burning
unnecessary cpu cycles or introducing any strange re-ordering). The driver
tells us when it invokes netif_wake_queue how much space it has for
descriptors by setting this variable.

Some decisions i had to make:
- every driver will have a xmit_win variable and the core will set it
to 1 which means the behavior of non-batching drivers stays the same.
- the batch list, blist, is no longer a pointer; wastes a little extra
memmory i plan to recoup by killing gso_skb in later patches.

Theres a lot of history and reasoning of why batching in a document
i am writting which i may submit as a patch.
Thomas Graf (who doesnt know this probably) gave me the impetus to
start looking at this back in 2004 when he invited me to the linux
conference he was organizing. Parts of what i presented in SUCON in
2004 talk about batching. Herbert Xu forced me to take a second look around
2.6.18 - refer to my netconf 2006 presentation. Krishna Kumar provided
me with more motivation in May 2007 when he posted on netdev and engaged me.
Sridhar Samudrala, Krishna Kumar, Matt Carlson, Michael Chan,
Jeremy Ethridge, Evgeniy Polyakov, Sivakumar Subramani, and
David Miller, have contributed in one or more of {bug fixes, enhancements,
testing, lively discussion}. The Broadcom and netiron folks have been
outstanding in their help.

Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED]

---
commit ab4b07ef2e4069c115c9c1707d86ae2344a5ded5
tree 994b42b03bbfcc09ac8b7670c53c12e0b2a71dc7
parent b0e36991c5850dfe930f80ee508b08fdcabc18d1
author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 23 Sep 2007 10:30:32 -0400
committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 23 Sep 2007 10:30:32 -0400

 include/linux/netdevice.h |   17 +++
 net/core/dev.c|  106 +
 2 files changed, 123 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cf89ce6..443cded 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -453,6 +453,7 @@ struct net_device
 #define NETIF_F_NETNS_LOCAL8192/* Does not change network namespaces */
 #define NETIF_F_MULTI_QUEUE16384   /* Has multiple TX/RX queues */
 #define NETIF_F_LRO32768   /* large receive offload */
+#define NETIF_F_BTX65536   /* Capable of batch tx */
 
/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT  16
@@ -578,6 +579,15 @@ struct net_device
void*priv;  /* pointer to private data  */
int (*hard_start_xmit) (struct sk_buff *skb,
struct net_device *dev);
+   /* hard_batch_xmit is needed for LLTX, kill it when those
+* disappear or better kill it now and dont support LLTX
+   */
+   int

[ofa-general] [PATCH 3/4][NET_BATCH] net core use batching

2007-09-23 Thread jamal
This patch adds the usage of batching within the core.

cheers,
jamal

[NET_BATCH] net core use batching

This patch adds the usage of batching within the core.
The same test methodology used in introducing txlock is used, with
the following results on different kernels:

++--+-+++
|   64B  |  128B| 256B| 512B   |1024B   |
++--+-+++
Original| 467482 | 463061   | 388267  | 216308 | 114704 |
||  | |||
txlock  | 468922 | 464060   | 388298  | 216316 | 114709 |
||  | |||
tg3nobtx| 468012 | 464079   | 388293  | 216314 | 114704 |
||  | |||
tg3btxdr| 480794 | 475102   | 388298  | 216316 | 114705 |
||  | |||
tg3btxco| 481059 | 475423   | 388285  | 216308 | 114706 |
++--+-+++

The first two colums Original and txlock were introduced in an earlier
patch and demonstrate a slight increase in performance with txlock.
tg3nobtx shows the tg3 driver with no changes to support batching.
The purpose of this test is to demonstrate the effect of introducing
the core changes to a driver that doesnt support them.
Although this patch brings down perfomance slightly compared to txlock
for such netdevices, it is still better compared to just the original kernel.
tg3btxdr demonstrates the effect of using -hard_batch_xmit() with tg3
driver. tg3btxco demonstrates the effect of letting the core do all the
work. As can be seen the last two are not very different in performance.
The difference is -hard_batch_xmit() introduces a new method which
is intrusive.

I have #if-0ed some of the old functions so the patch is more readable.

Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED]

---
commit e26705f6ef7db034df7af3f4fccd7cd40b8e46e0
tree b99c469497a0145ca5c0651dc4229ce17da5b31c
parent 6b8e2f76f86c35a6b2cee3698c633d20495ae0c0
author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 23 Sep 2007 11:35:25 -0400
committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 23 Sep 2007 11:35:25 -0400

 net/sched/sch_generic.c |  127 +++
 1 files changed, 115 insertions(+), 12 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 95ae119..86a3f9d 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -56,6 +56,7 @@ static inline int qdisc_qlen(struct Qdisc *q)
return q-q.qlen;
 }
 
+#if 0
 static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev,
  struct Qdisc *q)
 {
@@ -110,6 +111,97 @@ static inline int handle_dev_cpu_collision(struct sk_buff 
*skb,
 
return ret;
 }
+#endif
+
+static inline int handle_dev_cpu_collision(struct net_device *dev)
+{
+   if (unlikely(dev-xmit_lock_owner == smp_processor_id())) {
+   if (net_ratelimit())
+   printk(KERN_WARNING
+   Dead loop on netdevice %s, fix it urgently!\n,
+   dev-name);
+   return 1;
+   }
+   __get_cpu_var(netdev_rx_stat).cpu_collision++;
+   return 0;
+}
+
+static inline int
+dev_requeue_skbs(struct sk_buff_head *skbs, struct net_device *dev,
+  struct Qdisc *q)
+{
+
+   struct sk_buff *skb;
+
+   while ((skb = __skb_dequeue(skbs)) != NULL)
+   q-ops-requeue(skb, q);
+
+   netif_schedule(dev);
+   return 0;
+}
+
+static inline int
+xmit_islocked(struct sk_buff_head *skbs, struct net_device *dev,
+   struct Qdisc *q)
+{
+   int ret = handle_dev_cpu_collision(dev);
+
+   if (ret) {
+   if (!skb_queue_empty(skbs))
+   skb_queue_purge(skbs);
+   return qdisc_qlen(q);
+   }
+
+   return dev_requeue_skbs(skbs, dev, q);
+}
+
+static int xmit_count_skbs(struct sk_buff *skb)
+{
+   int count = 0;
+   for (; skb; skb = skb-next) {
+   count += skb_shinfo(skb)-nr_frags;
+   count += 1;
+   }
+   return count;
+}
+
+static int xmit_get_pkts(struct net_device *dev,
+  struct Qdisc *q,
+  struct sk_buff_head *pktlist)
+{
+   struct sk_buff *skb;
+   int count = dev-xmit_win;
+
+   if (count   dev-gso_skb) {
+   skb = dev-gso_skb;
+   dev-gso_skb = NULL;
+   count -= xmit_count_skbs(skb);
+   __skb_queue_tail(pktlist, skb);
+   }
+
+   while (count  0) {
+   skb = q-dequeue(q);
+   if (!skb)
+   break

[ofa-general] [PATCH 4/4][NET_SCHED] kill dev-gso_skb

2007-09-23 Thread jamal

This patch removes dev-gso_skb as it is no longer necessary with
batching code.

cheers,
jamal

[NET_SCHED] kill dev-gso_skb
The batching code does what gso used to batch at the drivers.
There is no more need for gso_skb. If for whatever reason the
requeueing is a bad idea we are going to leave packets in dev-blist
(and still not need dev-gso_skb)

Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED]

---
commit c6d2d61a73e1df5daaa294876f62454413fcb0af
tree 1d7bf650096a922a6b6a4e7d6810f83320eb94dd
parent e26705f6ef7db034df7af3f4fccd7cd40b8e46e0
author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 23 Sep 2007 12:25:10 -0400
committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 23 Sep 2007 12:25:10 -0400

 include/linux/netdevice.h |3 ---
 net/sched/sch_generic.c   |   12 
 2 files changed, 0 insertions(+), 15 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 443cded..7811729 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -560,9 +560,6 @@ struct net_device
struct list_headqdisc_list;
unsigned long   tx_queue_len;   /* Max frames per queue allowed 
*/
 
-   /* Partially transmitted GSO packet. */
-   struct sk_buff  *gso_skb;
-
/* ingress path synchronizer */
spinlock_t  ingress_lock;
struct Qdisc*qdisc_ingress;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 86a3f9d..b4e1607 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -172,13 +172,6 @@ static int xmit_get_pkts(struct net_device *dev,
struct sk_buff *skb;
int count = dev-xmit_win;
 
-   if (count   dev-gso_skb) {
-   skb = dev-gso_skb;
-   dev-gso_skb = NULL;
-   count -= xmit_count_skbs(skb);
-   __skb_queue_tail(pktlist, skb);
-   }
-
while (count  0) {
skb = q-dequeue(q);
if (!skb)
@@ -654,7 +647,6 @@ void dev_activate(struct net_device *dev)
 void dev_deactivate(struct net_device *dev)
 {
struct Qdisc *qdisc;
-   struct sk_buff *skb;
 
spin_lock_bh(dev-queue_lock);
qdisc = dev-qdisc;
@@ -662,15 +654,11 @@ void dev_deactivate(struct net_device *dev)
 
qdisc_reset(qdisc);
 
-   skb = dev-gso_skb;
-   dev-gso_skb = NULL;
if (!skb_queue_empty(dev-blist))
skb_queue_purge(dev-blist);
dev-xmit_win = 1;
spin_unlock_bh(dev-queue_lock);
 
-   kfree_skb(skb);
-
dev_watchdog_down(dev);
 
/* Wait for outstanding dev_queue_xmit calls. */
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCHES] TX batching

2007-09-23 Thread jamal
On Sun, 2007-23-09 at 12:36 -0700, Kok, Auke wrote:

 please be reminded that we're going to strip down e1000 and most of the 
 features
 should go into e1000e, which has much less hardware workarounds. I'm still
 reluctant to putting in new stuff in e1000 - I really want to chop it down 
 first ;)

sure - the question then is, will you take those changes if i use
e1000e? theres a few cleanups that have nothing to do with batching;
take a look at the modified e1000 on the git tree.

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000

2007-09-17 Thread jamal
On Sun, 2007-16-09 at 20:13 -0700, David Miller wrote:

 What Herbert and I want to do is basically turn on TSO for
 devices that can't do it in hardware, and rely upon the GSO
 framework to do the segmenting in software right before we
 hit the device.

Sensible. 

 This only makes sense for devices which can 1) scatter-gather
 and 2) checksum on transmit.  

If you have knowledge there are enough descriptors in the driver to
cover all skbs you are passing, do you need to have #1? 
Note i dont touch fragments, i am assuming the driver is smart enough to
handle them otherwise it wont advertise it can handle scatter-gather

 Otherwise we make too many copies and/or passes over the data.

I didnt understand this last bit - you are still going to go over the
list regardless of whether you call -hard_start_xmit() once or
multiple times over the same list, no? In the later case i am assuming
a trimmed down -hard_start_xmit()

 UDP is too easy a test case in fact :-)

I learnt a lot about the behavior out of doing udp (and before that with
pktgen); theres a lot of driver habits that may need to be tuned before
batching becomes really effective - which is easier to see with udp than
with tcp.

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000

2007-09-16 Thread jamal
On Sun, 2007-16-09 at 18:02 -0700, David Miller wrote:

 I do.
 
 And I'm reviewing and applying several hundred patches a day.
 
 What's the point? :-)

Reading the commentary made me think you were about to swallow that with
one more change by the time i wake up;-
I still think this work - despite my vested interest - needs more
scrutiny from a performance perspective.
I tend to send a url to my work, but it may be time to start posting
patches.

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000

2007-09-16 Thread jamal
On Sun, 2007-16-09 at 19:25 -0700, David Miller wrote:

 There are tertiary issues I'm personally interested in, for example
 how well this stuff works when we enable software GSO on a non-TSO
 capable card.
 
 In such a case the GSO segment should be split right before we hit the
 driver and then all the sub-segments of the original GSO frame batched
 in one shot down to the device driver.

I think GSO is still useful on top of this.
In my patches anything with gso gets put into the batch list and shot
down the driver. Ive never considered checking whether the nic is TSO
capable, that may be something worth checking into. The netiron allows
you to shove upto 128 skbs utilizing one tx descriptor, which makes for
interesting possibilities.

 In this way you'll get a large chunk of the benefit of TSO without
 explicit hardware support for the feature.
 
 There are several cards (some even 10GB) that will benefit immensely
 from this.

indeed - ive always wondered if batching this way would make the NICs
behave differently from the way TSO does.

On a side note: My observation is that with large packets on a very busy
system; bulk transfer type app, one approaches wire speed; with or
without batching, the apps are mostly idling (Ive seen upto 90% idle
time polling at the socket level for write to complete with a really
busy system). This is the case with or without batching. cpu seems a
little better with batching. As the aggregation of the apps gets more
aggressive (achievable by reducing their packet sizes), one can achieve
improved throughput and reduced cpu utilization. This all with UDP; i am
still studying tcp. 
 
cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] TSO, TCP Cong control etc

2007-09-14 Thread jamal
Ive changed the subject to match content..

On Fri, 2007-14-09 at 03:20 -0400, Bill Fink wrote:
 On Mon, 27 Aug 2007, jamal wrote:
 
  Bill:
  who suggested (as per your email) the 75usec value and what was it based
  on measurement-wise? 
 
 Belatedly getting back to this thread.  There was a recent myri10ge
 patch that changed the default value for tx/rx interrupt coalescing
 to 75 usec claiming it was an optimum value for maximum throughput
 (and is also mentioned in their external README documentation).

I would think such a value would be very specific to the ring size and
maybe even the machine in use. 

 I also did some empirical testing to determine the effect of different
 values of TX/RX interrupt coalescing on 10-GigE network performance,
 both with TSO enabled and with TSO disabled.  The actual test runs
 are attached at the end of this message, but the results are summarized
 in the following table (network performance in Mbps).
 
   TX/RX interrupt coalescing in usec (both sides)
  0  15  30  45  60  75  90 105
 
 TSO enabled   89099682971697259739974596889648
 TSO disabled  91139910991099109910991099109910

 TSO disabled performance is always better than equivalent TSO enabled
 performance.  With TSO enabled, the optimum performance is indeed at
 a TX/RX interrupt coalescing value of 75 usec.  With TSO disabled,
 performance is the full 10-GigE line rate of 9910 Mbps for any value
 of TX/RX interrupt coalescing from 15 usec to 105 usec.

Interesting results. I think J Heffner made a very compelling
description the other day based on your netstat results at the receiver
as to what is going on (refer to the comments on stretch ACKs). If the
receiver is fixed, then youd see better numbers from TSO. 

The 75 microsecs is very benchmarky in my opinion. If i was to pick a
different app or different NIC or run on many cpus with many apps doing
TSO, i highly doubt that will be the right number.


 Here's a retest (5 tests each):
 
 TSO enabled:
 
 TCP Cubic (initial_ssthresh set to 0):
[..]

 TCP Bic (initial_ssthresh set to 0):
[..]
 
 TCP Reno:
 
[..]
 TSO disabled:
 
 TCP Cubic (initial_ssthresh set to 0):
 
[..]
 TCP Bic (initial_ssthresh set to 0):
 
[..]
 TCP Reno:
 
[..]
 Not too much variation here, and not quite as high results
 as previously.  

BIC seems to be on average better followed by CUBIC followed by Reno.
The difference this time maybe because you set the ssthresh to 0
(hopefully every run) and so Reno is definetely going to perform less
better since it is a lot less agressive in comparison to other two.

 Some further testing reveals that while this
 time I mainly get results like (here for TCP Bic with TSO
 disabled):
 
 [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
  4958.0625 MB /  10.02 sec = 4148.9361 Mbps 100 %TX 99 %RX
 
 I also sometimes get results like:
 
 [EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
  5882.1875 MB /  10.00 sec = 4932.5549 Mbps 100 %TX 90 %RX
 

not good.

 The higher performing results seem to correspond to when there's a
 somewhat lower receiver CPU utilization.  I'm not sure but there
 could also have been an effect from running the -M1460 test after
 the 9000 byte jumbo frame test (no jumbo tests were done at all prior
 to running the above sets of 5 tests, although I did always discard
 an initial warmup test, and now that I think about it some of
 those initial discarded warmup tests did have somewhat anomalously
 high results).

If you didnt reset the ssthresh on every run, could it have been cached
and used on subsequent runs?

  A side note: Although the experimentation reduces the variables (eg
  tying all to CPU0), it would be more exciting to see multi-cpu and
  multi-flow sender effect (which IMO is more real world). 
 
 These systems are intended as test systems for 10-GigE networks,
 and as such it's important to get as consistently close to full
 10-GigE line rate as possible, and that's why the interrupts and
 nuttcp application are tied to CPU0, with almost all other system
 applications tied to CPU1.

Sure, good benchmark. You get to know how well you can do.

 Now on another system that's intended as a 10-GigE firewall system,
 it has 2 Myricom 10-GigE NICs with the interrupts for eth2 tied to
 CPU0 and the interrupts for CPU1 tied to CPU1.  In IP forwarding
 tests of this system, I have basically achieved full bidirectional
 10-GigE line rate IP forwarding with 9000 byte jumbo frames.

In forwarding a more meaningful metric would be pps. The cost per packet
tends to dominate the results over the cost/byte.
9K jumbo frames at 10G is less than 500Kpps - so i dont see that machine
you are using sweating at all. To give you a comparison on a lower end
opteron a single CPU i can generate with batching pktgen 1Mpps; Robert
says he can do that even without batching on an opteron closer to what
you

[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-27 Thread jamal
On Sun, 2007-26-08 at 19:04 -0700, David Miller wrote:

 The transfer is much better behaved if we ACK every two full sized
 frames we copy into the receiver, and therefore don't stretch ACK, but
 at the cost of cpu utilization.

The rx coalescing in theory should help by accumulating more ACKs on the
rx side of the sender. But it doesnt seem to do that i.e For the 9K MTU,
you are better off to turn off the coalescing if you want higher
numbers. Also some of the TOE vendors (chelsio?) claim to have fixed
this by reducing bursts on outgoing packets.
 
Bill:
who suggested (as per your email) the 75usec value and what was it based
on measurement-wise? 
BTW, thanks for the finding the energy to run those tests and a very
refreshing perspective. I dont mean to add more work, but i had some
queries;
On your earlier tests, i think that Reno showed some significant
differences on the lower MTU case over BIC. I wonder if this is
consistent? 
A side note: Although the experimentation reduces the variables (eg
tying all to CPU0), it would be more exciting to see multi-cpu and
multi-flow sender effect (which IMO is more real world). 
Last note: you need a newer netstat.

 These effects are particularly pronounced on systems where the
 bus bandwidth is also one of the limiting factors.

Can you elucidate this a little more Dave? Did you mean memory
bandwidth? 

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-24 Thread jamal
On Thu, 2007-23-08 at 23:18 -0400, Bill Fink wrote:

[..]
 Here you can see there is a major difference in the TX CPU utilization
 (99 % with TSO disabled versus only 39 % with TSO enabled), although
 the TSO disabled case was able to squeeze out a little extra performance
 from its extra CPU utilization.  

Good stuff. What kind of machine? SMP?
Seems the receive side of the sender is also consuming a lot more cpu
i suspect because receiver is generating a lot more ACKs with TSO.
Does the choice of the tcp congestion control algorithm affect results?
it would be interesting to see both MTUs with either TCP BIC vs good old
reno on sender (probably without changing what the receiver does). BIC
seems to be the default lately.

 Interestingly, with TSO enabled, the
 receiver actually consumed more CPU than with TSO disabled, 

I would suspect the fact that a lot more packets making it into the
receiver for TSO contributes.

 so I guess
 the receiver CPU saturation in that case (99 %) was what restricted
 its performance somewhat (this was consistent across a few test runs).

Unfortunately the receiver plays a big role in such tests - if it is
bottlenecked then you are not really testing the limits of the
transmitter. 

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-23 Thread jamal
On Wed, 2007-22-08 at 13:21 -0700, David Miller wrote:
 From: Rick Jones [EMAIL PROTECTED]
 Date: Wed, 22 Aug 2007 10:09:37 -0700
 
  Should it be any more or less worrysome than small packet
  performance (eg the TCP_RR stuff I posted recently) being rather
  worse with TSO enabled than with it disabled?
 
 That, like any such thing shown by the batching changes, is a bug
 to fix.

Possibly a bug - but you really should turn off TSO if you are doing
huge interactive transactions (which is fair because there is a clear
demarcation).
The litmus test is the same as any change that is supposed to improve
net performance - it has to demonstrate it is not intrusive and that it
improves (consistently) performance. The standard metrics are
{throughput, cpu-utilization, latency} i.e as long as one improves and
others remain zero, it would make sense. Yes, i am religious for
batching after all the invested sweat (and i continue to work on it
hoping to demystify) - the theory makes a lot of sense.
 
cheers,
jamal


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-23 Thread jamal
On Thu, 2007-23-08 at 18:04 -0400, jamal wrote:

 The litmus test is the same as any change that is supposed to improve
 net performance - it has to demonstrate it is not intrusive and that it
 improves (consistently) performance. The standard metrics are
 {throughput, cpu-utilization, latency} i.e as long as one improves and
 others remain zero, it would make sense. Yes, i am religious for
 batching after all the invested sweat (and i continue to work on it
 hoping to demystify) - the theory makes a lot of sense.

Before someone jumps and strangles me ;- By litmus test i meant as
applied to batching. [TSO already passed - iirc, it has been
demostranted to really not add much to throughput (cant improve much
over closeness to wire speed) but improve CPU utilization].

cheers,
jamal


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-23 Thread jamal
On Thu, 2007-23-08 at 15:30 -0700, David Miller wrote:
 From: jamal [EMAIL PROTECTED]
 Date: Thu, 23 Aug 2007 18:04:10 -0400
 
  Possibly a bug - but you really should turn off TSO if you are doing
  huge interactive transactions (which is fair because there is a clear
  demarcation).
 
 I don't see how this can matter.
 
 TSO only ever does anything if you accumulate more than one MSS
 worth of data.

I stand corrected then.

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-23 Thread jamal
On Thu, 2007-23-08 at 15:35 -0700, Rick Jones wrote:
 jamal wrote:
  [TSO already passed - iirc, it has been
  demostranted to really not add much to throughput (cant improve much
  over closeness to wire speed) but improve CPU utilization].
 
 In the one gig space sure, but in the 10 Gig space, TSO on/off does make a 
 difference for throughput.

I am still so 1Gige;- I stand corrected again ;-

cheers,
jamal


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-21 Thread jamal
On Tue, 2007-21-08 at 00:18 -0700, David Miller wrote:

 Using 16K buffer size really isn't going to keep the pipe full enough
 for TSO. 

Why the comparison with TSO (or GSO for that matter)?
Seems to me that is only valid/fair if you have a single flow. 
Batching is multi-flow focused (or i should say flow-unaware).

cheers,
jamal


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-21 Thread jamal
On Tue, 2007-21-08 at 11:51 -0700, David Miller wrote:

 Because TSO does batching already, so it's a very good
 tit for tat comparison of the new batching scheme
 vs. an existing one.

Fair enough - I may have read too much into your email then;-

For bulk type of apps (where TSO will make a difference) this a fair
test. Hence i agree the 16KB buffer size is not sensible if the goal is
to simulate such an app. 
However (and this is where i read too much into what you were saying)
that the test by itself is insufficient comparison. You gotta look at
the other side of the coin i.e. at apps where TSO wont buy much.
Examples, a busy ssh or irc server and you could go as far as looking at
the most pre-dominant app on the wild west, http (average page size from
a few years back was in the range of 10-20K and can be simulated with
good ole netperf/iperf). 

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 3/9 Rev3] [sched] Modify qdisc_run to support batching

2007-08-08 Thread jamal
On Wed, 2007-08-08 at 16:05 +0200, Patrick McHardy wrote:

 This is not completely avoidable of
 course, but you can and should limit the damage by only pulling
 out as much packets as the driver can take and have the driver
 stop the queue afterwards.

This why the dev-xmit_win exists in my patches. The driver says how
much space it has using this variable - and only those packets get
pulled off; so there is no conflict with any scheduling algorithm be it
work/non-work conserving. 
The ideal case is never to carry over anything in dev-blist. However,
there is a challenge with GSO/TSO: if say the descriptors in the driver
are less than the list of skbs, you are faced with the dilema of sending
a fraction of the gso list first.
My current approach is to transmit as much as the driver would allow.
On the next opportunity to transmit, you do immediately send anything
remaining first before you ask the qdisc for more packets. An
alternative approach i had entertained was not to send anything from the
skb list when hitting such a case, but it hasnt seem needed so far
(havent seen any leftovers from my experiments).

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-08 Thread jamal
On Wed, 2007-08-08 at 21:42 +0800, Herbert Xu wrote:
 On Wed, Aug 08, 2007 at 03:49:00AM -0700, David Miller wrote:
  
  Not because I think it obviates your work, but rather because I'm
  curious, could you test a TSO-in-hardware driver converted to
  batching and see how TSO alone compares to batching for a pure
  TCP workload?
 
 You could even lower the bar by disabling TSO and enabling
 software GSO.

From my observation for TCP packets slightly above MDU (upto 2K), GSO
gives worse performance than non-GSO throughput-wise. Actually this has
nothing to do with batching, rather the behavior is consistent with or
without batching changes. 

  I personally don't think it will help for that case at all as
  TSO likely does better job of coalescing the work _and_ reducing
  bus traffic as well as work in the TCP stack.
 
 I agree.
 I suspect the bulk of the effort is in getting
 these skb's created and processed by the stack so that by
 the time that they're exiting the qdisc there's not much
 to be saved anymore.

pktgen shows a clear win if you test the driver path - which is what you
should test because thats where the batching changes are. 
Using TCP or UDP adds other variables[1] that need to be isolated first
in order to quantify the effect of batching. 
For throughput and CPU utilization, the benefit will be clear when there
are a lot more flows. 

cheers,
jamal

[1] I think there are too many other variables in play unfortunately
when you are dealing with a path that starts above the driver and one
that covers end to end effect:  traffic/app source, system clock sources
as per my recent discovery, congestion control algorithms used, tuning
of recevier etc.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 00/10] Implement batching skb API

2007-07-24 Thread jamal
KK,

On Tue, 2007-24-07 at 09:14 +0530, Krishna Kumar2 wrote:

 
 J Hadi Salim [EMAIL PROTECTED] wrote on 07/23/2007 06:02:01 PM:


 Actually you have not sent netperf results with prep and without prep.

My results were based on pktgen (which i explained as testing the
driver). I think depending on netperf without further analysis is
simplistic. It was like me doing forwarding tests on these patches.

  So _which_ non-LLTX driver doesnt do that? ;-
 
 I have no idea since I haven't looked at all drivers. Can you tell which
 all non-LLTX drivers does that ? I stated this as the sole criterea.

The few i have peeked at all do it. I also think the e1000 should be
converted to be non-LLTX. The rest of netdev is screaming to kill LLTX. 

  tun driver doesnt use it either - but i doubt that makes it bloat
 
 Adding extra code that is currently not usable (esp from a submission
 point) is bloat.

So far i have converted 3 drivers, 1 of them doesnt use it. Two more
driver conversions are on the way, they will both use it. How is this
bloat again? 
A few emails back you said if only IPOIB can use batching then thats
good enough justification. 

  You waltz in, have the luxury of looking at my code, presentations, many
  discussions with me etc ...
 
 luxury ? 
 I had implemented the entire thing even before knowing that you
 are working on something similar! and I had sent the first proposal to
 netdev,

I saw your patch at the end of may (or at least 2 weeks after you said
it existed). That patch has very little resemblance to what you just
posted conceptwise or codewise. I could post it if you would give me
permission.

 *after* which you told that you have your own code and presentations (which
 I had never seen earlier - I joined netdev a few months back, earlier I was
 working on RDMA, Infiniband as you know).

I am gonna assume you didnt know of my work - which i have been making
public for about 3 years. Infact i talked about this topic when i
visited your office in 2006 on a day you were not present, so it is
plausible you didnt hear of it.

  And it didn't give me any great
 ideas either, remember I had posted results for E1000 at the time of
 sending the proposals. 

In mid-June you sent me a series of patches which included anything from
changing variable names to combining qdisc_restart and about everything
i referred to as being cosmetic differences in your posted patches. I
took two of those and incorporated them in. One was an XXX in my code
already to allocate the dev-blist 
(Commit: bb4464c5f67e2a69ffb233fcf07aede8657e4f63). 
The other one was a mechanical removal of the blist being passed
(Commit: 0e9959e5ee6f6d46747c97ca8edc91b3eefa0757). 
Some of the others i asked you to defer. For example, the reason i gave
you for not merging any qdisc_restart_combine changes is because i was
waiting for Dave to swallow the qdisc_restart changes i made; otherwise
maintainance becomes extremely painful for me. 
Sridhar actually provided a lot more valuable comments and fixes but has
not planted a flag on behalf of the queen of spain like you did. 

 However I do give credit in my proposal to you for what
 ideas that your provided (without actual code), and the same I did for other
 people who did the same, like Dave, Sridhar. BTW, you too had discussions 
 with me,
 and I sent some patches to improve your code too, 

I incorporated two of your patches and asked for deferal of others.
These patches have now shown up in what you claim as the difference. I
just call them cosmetic difference not to downplay the importance of
having an ethtool interface but because they do not make batching
perform any better. The real differences are those two items. I am
suprised you havent cannibalized those changes as well. I thought you
renamed them to something else; according to your posting:
This patch will work with drivers updated by Jamal, Matt  Michael Chan
with minor modifications - rename xmit_win to xmit_slots  rename batch
handler. Or maybe thats a future plan you have in mind?

 so it looks like a two
 way street to me (and that is how open source works and should).

Open source is a lot more transparent than that.

You posted a question, which was part of your research. I responded and
told you i have patches; you asked me for them and i promptly ported
them from pre-2.6.18 to the latest kernel at the time. 

The nature of this batching work is one of performance. So numbers are
important. If you had some strong disagreements on something in the
architecture, then it would be of great value to explain it in a
technical detail - and more importantly to provide some numbers to say
why it is a bad idea. You get numbers by running some tests. 
You did none of the above. Your effort has been to produce your patch
for whatever reasons. This would not have been problematic to me if it
actually was based within reasons of optimization because the end goal
would have been achieved.

I have deleted the rest of the email

[ofa-general] Re: [PATCH 00/10] Implement batching skb API

2007-07-23 Thread jamal
KK,

On Mon, 2007-23-07 at 10:19 +0530, Krishna Kumar2 wrote:

 Hmmm ? Evgeniy has not even tested my code to find some regression :) And
 you may possibly not find much improvement in E1000 when you run iperf
 (which is what I do) compared to pktgen. 

Pktgen is the correct test (or the closest to correct) because it tests
the driver tx path. iperf/netperf test the effect of batching on
tcp/udp. Infact i would start with udp first. What you need to do if
testing end-2-end is see where the effects occur. For example, it is
feasible that batching is a little too aggressive and the receiver cant
keep up (netstat -s before and after will be helpful).
Maybe by such insight we can improve things.

  My experiments show it is useful (in a very visible way using pktgen)
  for e1000 to have the prep() interface.
 
 I meant : have you compared results of batching with prep on vs prep off,
 and
 what is the difference in BW ?

Yes, and these results were sent to you as well a while back.
When i get the time when i get back i will look em up in my test machine
and resend.

 No. I see value only in non-LLTX drivers which also gets the same TX lock
 in the RX path.

So _which_ non-LLTX driver doesnt do that? ;-

  The value is also there in LLTX drivers even if in just formating a skb
  ready for transmit. If this is not clear i could do a much longer
  writeup on my thought evolution towards adding prep().
 
 In LLTX drivers, the driver does the 'prep' without holding the tx_lock in
 any case, so there should be no improvement. Could you send the write-up

I will - please give me sometime; i am overloaded at the moment.

 There is *nothing* IPoIB specific or focus in my code. 
 I said adding prep
 doesn't
 work for IPoIB and so it is pointless to add bloat to the code until some
 code can

tun driver doesnt use it either - but i doubt that makes it bloat

  What I meant to say
 is that there isn't much point in saying that your code is not ready or
 you are using old code base, or has multiple restart functions, or is not
 tested enough, etc, and then say let's re-do/rethink the whole
 implementation when my code is already working and giving good results.

The suggestive hand gesturing is the kind of thing that bothers me. What
do you think: Would i be submitting patches in baed on 2.6.22-rc4? Would
it make sense to include parallel qdisc paths? For heavens sake, i have
told you i would be fine with accepting such changes when the qdisc
restart changes went in first.
You waltz in, have the luxury of looking at my code, presentations, many
discussions with me etc ...
When i ask for differences to code you produced, they now seem to sum up
to the two below. You dont think theres some honest issue with this
picture?

 OTOH, if you find some cases that are better handled with :
   1. prep handler
   2. xmit_win (which I don't have now),
 then please send me patches and I will also test out and incorporate.
 

And then of course you will end up adding those because they are both
useful, just calling them some other name. And then you will end up
incorporating all the drivers i invested many hours (as a gratitous
volunteer) to change and test - maybe you will change varibale names or
rearrange some function. 
I am a very compromising person; i have no problem coauthoring these
patches if you actually invest useful time like fixing things up and
doing proper tests. But you are not doing that - instead you are being
extremely aggressive and hijacking the whole thing. It is courteous if
you find somebody else has a patch you point out whats wrong preferably
with some proof. 

  It sounds disingenuous but i may have misread you.
 
 (lacking in frankness, candor, or sincerity; falsely or hypocritically
 ingenuous; insincere)  Sorry, no response to personal comments and
 have a flame-war :)

Give me a better description. 

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 00/10] Implement batching skb API

2007-07-22 Thread jamal
KK,

On Sun, 2007-22-07 at 11:57 +0530, Krishna Kumar2 wrote:

 Batching need not be useful for every hardware. 

My concern is there is no consistency in results. I see improvements on
something which you say dont. You see improvement in something that
Evgeniy doesnt etc. 
There are many knobs and we need in the minimal to find out how those
play.

 I don't quite agree with that approach, eg, if the blist is empty and the
 driver tells there is space for one packet, you will add one packet and
 the driver sends it out and the device is stopped (with potentially lot of
 skbs on dev-q). Then no packets are added till the queue is enabled, at
 which time a flood of skbs will be processed increasing latency and holding
 lock for a single longer duration. My approach will mitigate holding lock
 for longer times and instead send skbs to the device as long as we are
 within the limits.

Just as a side note _I do have this feature_ in the pktgen piece.
Infact, You can tell pktgen what that bound is as opposed to the hard
coding(look at the pktgen batchl parameter). I have not found it to be
useful experimentally; actually, i should say i could not zone on a
useful value by experimenting and it was better to turn it off.
I never tried adding it to the qdisc path - but this is something i
could try and as i said it may prove useful.

 Since E1000 doesn't seem to use the TX lock on RX (atleast I couldn't find
 it),
 I feel having prep will not help as no other cpu can execute the queue/xmit
 code anyway (E1000 is also a LLTX driver).

My experiments show it is useful (in a very visible way using pktgen)
for e1000 to have the prep() interface.

  Other driver that hold tx lock could get improvement however.

So you do see the value then with non LLTX drivers, right? ;- 
The value is also there in LLTX drivers even if in just formating a skb
ready for transmit. If this is not clear i could do a much longer
writeup on my thought evolution towards adding prep().

 I wonder if you tried enabling/disabling 'prep' on E1000 to see how the
 performance is affected. 

Absolutely. And regardless of whether its beneficial or not for e1000,
theres clear benefit in the tg3 for example.

 If it helps, I guess you could send me a patch to
 add that and I can also test it to see what the effect is. I didn't add it
 since IPoIB wouldn't be able to exploit it (unless someone is kind enough
 to show me how to).

Such core code should not just be focussed on IPOIB.

 I think the code I have is ready and stable, 

I am not sure how to intepret that - are you saying all-is-good and we
should just push your code in? It sounds disingenuous but i may have
misread you. 

cheers,
jamal

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH 00/10] Implement batching skb API

2007-07-21 Thread jamal

I am (have been) under extreme travel mode - so i will have high latency
in follow ups.

On Fri, 2007-20-07 at 12:01 +0530, Krishna Kumar wrote:
 Hi Dave, Roland, everyone,
 
 In May, I had proposed creating an API for sending 'n' skbs to a driver to
 reduce lock overhead, DMA operations, and specific to drivers that have
 completion notification like IPoIB - reduce completion handling ([RFC] New
 driver API to speed up small packets xmits @
 http://marc.info/?l=linux-netdevm=117880900818960w=2). I had also sent
 initial test results for E1000 which showed minor improvements (but also
 got degradations) @http://marc.info/?l=linux-netdevm=117887698405795w=2.
 

Add to that context: that i have been putting out patches on this over
the last 3+ years as well as several public presentations = last one
being: http://vger.kernel.org/jamal_netconf2006.sxi

My main problem (and obstacles to submitting the patches) has been a
result of not doing the approriate testing - i had been testing
forwarding path (in all my results post the latest patches) when i
should really have been testing the improvement of the tx path. 

 There is a parallel WIP by Jamal but the two implementations are completely
 different since the code bases from the start were separate. Key changes:
   - Use a single qdisc interface to avoid code duplication and reduce
 maintainability (sch_generic.c size reduces by ~9%).
   - Has per device configurable parameter to turn on/off batching.
   - qdisc_restart gets slightly modified while looking simple without
 any checks for batching vs regular code (infact only two lines have
 changed - 1. instead of dev_dequeue_skb, a new batch-aware function
 is called; and 2. an extra call to hard_start_xmit_batch.

   - No change in__qdisc_run other than a new argument (from DM's idea).
   - Applies to latest net-2.6.23 compared to 2.6.22-rc4 code.

All the above are cosmetic differences. To me is the highest priority
is making sure that batching is useful and what the limitations are.
At some point, when all looks good - i dont mind adding an ethtool
interface to turn off/on batching, merge with the new qdisc restart path
instead of having a parallel path, solicit feedback on naming, where to
allocate structs etc etc. All that is low prio if batching across a
variety of hardware and applications doesnt prove useful. At the moment,
i am unsure theres consistency to justify push batching in.

Having said that below are the main architectural differences we have
which is what we really need to discuss and see what proves useful:

 - Batching algo/processing is different (eg. if
   qdisc_restart() finds
 one skb in the batch list, it will try to batch more (upto a limit)
 instead of sending that out and batching the rest in the next call.

This sounds a little more aggressive but maybe useful.
I have experimented with setting upper bound limits (current patches
have a pktgen interface to set the max to send) and have concluded that
it is unneeded. Probing by letting the driver tell you what space is
available has proven to be the best approach. I have been meaning to
remove the code in pktgen which allows these limits.
 
   - Jamal's code has a separate hw prep handler called from the stack,
 and results are accessed in driver during xmit later.

I have explained the reasoning to this a few times. A recent response to
Michael Chan is here:
http://marc.info/?l=linux-netdevm=118346921316657w=2
And heres a response to you that i havent heard back on:
http://marc.info/?l=linux-netdevm=118355539503924w=2

My tests so far indicate this interface is useful. It doesnt apply well
to some drivers (for example i dont use it in tun) - which makes it
optional but useful nevertheless. I will be more than happy to kill this
if i can find cases where it proves to be a bad idea.

   - Jamal's code has dev-xmit_win which is cached by the driver. Mine
 has dev-xmit_slots but this is used only by the driver while the
 core has a different mechanism to find how many skbs to batch.

This is related to the first item.

   - Completely different structure/design  coding styles.
 (This patch will work with drivers updated by Jamal, Matt  Michael Chan with
 minor modifications - rename xmit_win to xmit_slots  rename batch handler)

Again, cosmetics (and indication you are morphing towards me).

So if i was to sum up this, (it would be useful discussion to have on
these) the real difference is:

a) you have an extra check on refilling the skb list when you find that
it has a single skb. I tagged this as being potentially useful.
b) You have a check for some upper bound on the number of skbs to send
to the driver. I tagged this as unnecessary - the interface is still on
in my current code, so it shouldnt be hard to show one way or other.
c) You dont have prep_xmit()

Add to that list any other architectural differences

[ofa-general] TCP and batching WAS(Re: [PATCH 00/10] Implement batching skb API

2007-07-21 Thread jamal
On Fri, 2007-20-07 at 08:18 +0100, Stephen Hemminger wrote:

 You may see worse performance with batching in the real world when
 running over WAN's.  Like TSO, batching will generate back to back packet
 trains that are subject to multi-packet synchronized loss. 

Has someone done any study on TSO effect? Doesnt ECN with a RED router
help on something like this?
I find it suprising that a single flow doing TSO would overwhelm a
routers buffer. I actually think the value of batching as far as TCP is
concerned is propotional to the number of flows. i.e the more flows you
have the more batching you will end up doing. And if TCPs fairness is
the legend talk it has been made to be, then i dont see this as
problematic.

BTW, something i noticed regards to GSO when testing batching:
For TCP packets slightly above MDU (upto 2K), GSO gives worse
performance than non-GSO. Actually has nothing to do with batching,
rather it works the same way with or without batching changes.

Another oddity:
Looking at the flow rate from a purely packets/second (I know thats a
router centric view, but i found it strange nevertheless) - you see that
as packet size goes up, the pps also goes up. I tried mucking around
with nagle etc, but saw no observable changes. Any insight?
My expectation was that the pps would stay at least the same or get
better with smaller packets (assuming theres less data to push around).

cheers,
jamal



___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general