Mugunthan and all,
We have essentially proven our theory, but have yet to solve the problem. Our
theory was that the driver was essentially taking a few second nap during
transmission, despite there being plenty more work to do. We further theorized
that our custom userspace application was somehow preventing the driver from
taking this nap, and that's why our RX performance is perfect all the time
while our TX is so spotty unless that application is running.
Over the last few days, we've went through several test patches and have
essentially tracked the issue back to our emac_poll() function not being
called. By inserting the following patch snippet, and snippets like it, into
various functions in the emac driver, we were able to prove that from time to
time these functions sleep for several seconds, despite iperf certainly having
given them something to do. (Note: the patch has changed a bit since the
output pasted below was captured, as we were looking for more detail, but the
core functionality is the same.)
===================================================================================================
[[ ---- patch snippet ---- ]]
===================================================================================================
--- linux/drivers/net/ethernet/ti/davinci_emac.c 2013-04-08
16:20:19.202550042 -0400
+++ linux_emac_tst/drivers/net/ethernet/ti/davinci_emac.c 2013-04-08
18:45:37.000000000 -0400
@@ -64,6 +64,12 @@
#include "davinci_cpdma.h"
+//// DEBUGGING ONLY!!!!
+#include <linux/jiffies.h>
+static unsigned long qstopped_jifs = 0;
+static unsigned long poll_func_jifs = 0;
+static u32 last_num_rx_pkts = 0, last_num_tx_pkts = 0;
+
static int debug_level;
module_param(debug_level, int, 0);
MODULE_PARM_DESC(debug_level, "DaVinci EMAC debug level (NETIF_MSG bits)");
@@ -1055,7 +1061,19 @@
* queue is stopped then start the queue as we have free desc for tx
*/
if (unlikely(netif_queue_stopped(ndev)))
+ {
netif_wake_queue(ndev);
+ if (qstopped_jifs > 0)
+ {
+ // we've been stopped, measure how long
+ if ( time_after_eq(jiffies, (qstopped_jifs + HZ/2)) )
+ {
+ // we've been stopped >= 1/2 second!!!
+ printk(KERN_ERR "===EMAC: TX queue has been stopped for > 1/2
second!!!\n");
+ }
+ }
+ qstopped_jifs = 0;
+ }
ndev->stats.tx_packets++;
ndev->stats.tx_bytes += len;
dev_kfree_skb_any(skb);
@@ -1105,7 +1123,11 @@
* tell the kernel to stop sending us tx frames.
*/
if (unlikely(!cpdma_check_free_tx_desc(priv->txchan)))
+ {
netif_stop_queue(ndev);
+ // we just stopped, record the time
+ qstopped_jifs = jiffies;
+ }
return NETDEV_TX_OK;
@@ -1374,6 +1396,15 @@
u32 status = 0;
u32 num_tx_pkts = 0, num_rx_pkts = 0;
+
+unsigned long tmp = jiffies;
+if ( time_after_eq(tmp, (poll_func_jifs + HZ/2)) )
+{
+ // we've been stopped >= 1/2 second!!!
+ printk(KERN_ERR "===EMAC: POLL FUNCTION, last RX/TX=[%d][%d]\n",
last_num_rx_pkts, last_num_tx_pkts);
+}
+poll_func_jifs = tmp;
+
/* Check interrupt vectors and call packet processing */
status = emac_read(EMAC_MACINVECTOR);
@@ -1385,6 +1416,10 @@
if (status & mask) {
num_tx_pkts = cpdma_chan_process(priv->txchan,
EMAC_DEF_TX_MAX_SERVICE);
+ if (num_tx_pkts < 0)
+ {
+printk(KERN_ERR "=== ERR!!! num_tx_pkts=%d\n", num_tx_pkts);
+ }
} /* TX processing */
mask = EMAC_DM644X_MAC_IN_VECTOR_RX_INT_VEC;
@@ -1394,8 +1429,17 @@
if (status & mask) {
num_rx_pkts = cpdma_chan_process(priv->rxchan, budget);
+ if (num_rx_pkts < 0)
+ {
+printk(KERN_ERR "=== ERR!!! num_rx_pkts=%d\n", num_rx_pkts);
+ }
} /* RX processing */
+
+last_num_tx_pkts = num_tx_pkts;
+last_num_rx_pkts = num_rx_pkts;
+
+
mask = EMAC_DM644X_MAC_IN_VECTOR_HOST_INT;
if (priv->version == EMAC_VERSION_2)
mask = EMAC_DM646X_MAC_IN_VECTOR_HOST_INT;
===================================================================================================
[[ ---- patch snippet ---- ]]
===================================================================================================
It appears from our debugging that somehow the emac_poll() function is not
being called despite there being work to do. As before however, when we start
our custom application (which presently does very little from a system calls'
perspective), our performance stabilizes greatly. Please see below.
Has anyone else seen anything like this with the EMAC, or have any idea why or
how this could be happening? We initially thought it could just be a failure
to properly reschedule the emac_wait(), but after playing with various items
there for a while, we're convinced that's not it. Now we're guessing that
perhaps softirqs or something within the DMA world is to blame here, but we're
certainly not experts in this area. Again, we've passed in 'nohlt' and have
worked to remove power management. So we don't think that plays a role.
Likewise, we never see a packet error -- even in our Wireshark traces. We just
see the system take a split-second to multi-second break.
What specifically causes the emac_poll() and emac_tx_handler() functions to
fire, and is there a way to debug that cleanly?
Can anyone imagine how a userspace application would be able to improve this
behavior so much?
Again, thank you so very much for your time.
CF
Boot log:
===================================================================================================
Fresh boot with no network activity, we see this with our debugging in there.
===================================================================================================
[ 22.815582] ===EMAC: POLL FUNCTION has not been called for > 1/2 second!!!
[ 22.822937] ===EMAC: TX FUNCTION has not been called for > 1/2 second!!!
[ 23.817993] ===EMAC: POLL FUNCTION has not been called for > 1/2 second!!!
[ 23.825378] ===EMAC: TX FUNCTION has not been called for > 1/2 second!!!
[ 24.817810] ===EMAC: POLL FUNCTION has not been called for > 1/2 second!!!
[ 24.825225] ===EMAC: TX FUNCTION has not been called for > 1/2 second!!!
[ 30.793365] ===EMAC: POLL FUNCTION has not been called for > 1/2 second!!!
[ 30.800811] ===EMAC: TX FUNCTION has not been called for > 1/2 second!!!
[ 31.427093] ===EMAC: POLL FUNCTION has not been called for > 1/2 second!!!
[ 31.434661] ===EMAC: TX FUNCTION has not been called for > 1/2 second!!!
[ 32.426879] ===EMAC: POLL FUNCTION has not been called for > 1/2 second!!!
[ 32.434387] ===EMAC: TX FUNCTION has not been called for > 1/2 second!!!
root@test:~# iperf-i 5 -t 30000 -c 10.22.255.254
[ 43.093078] ===EMAC: POLL FUNCTION has not been called for > 1/2 second!!!
[ 43.100463] ===EMAC: TX FUNCTION has not been called for > 1/2 second!!!
===================================================================================================
Starting iperf with 4x instances of <Custom App> stuff running. See? It's
very happy for a long time.
===================================================================================================
------------------------------------------------------------
Client connecting to 10.22.255.254, TCP port 5001
TCP window size: 19.6 KByte (default)
------------------------------------------------------------
[ 3] local 10.22.0.2 port 44102 connected with 10.22.255.254 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 5.0 sec 56.8 MBytes 95.3 Mbits/sec
[ 3] 5.0-10.0 sec 56.0 MBytes 94.0 Mbits/sec
[[ ---- SNIP ---- ]]
[ 3] 615.0-620.0 sec 56.4 MBytes 94.6 Mbits/sec
[ 3] 620.0-625.0 sec 55.3 MBytes 92.9 Mbits/sec
===================================================================================================
Stopping all 4 <Custom App> instances from running...
===================================================================================================
Stopping <Custom App>1...
[ OK ] Stopped <Custom App>1.
Stopping <Custom App>2...
[ OK ] Stopped <Custom App>2.
Stopping <Custom App>3...
[ OK ] Stopped <Custom App>3.
[ 3] 625.0-630.0 sec 55.1 MBytes 92.4 Mbits/sec
Stopping <Custom App>4...
[ OK ] Stopped <Custom App>4.
===================================================================================================
Instantaneously, we see the POLL (and therefore the TX handler) warnings
getting called
and the performance is slipping and/or timing out all together.
===================================================================================================
[ 681.469207] ===EMAC: POLL FUNCTION has not been called for > 1/2 second!!!
[ 681.476501] ===EMAC: TX FUNCTION has not been called for > 1/2 second!!!
[ 3] 630.0-635.0 sec 8.59 MBytes 14.4 Mbits/sec
[ 692.489929] ===EMAC: POLL FUNCTION has not been called for > 1/2 second!!!
[ 692.497222] ===EMAC: TX FUNCTION has not been called for > 1/2 second!!!
[ 692.504272] ===EMAC: TX queue has been stopped for > 1/2 second!!!
[ 3] 635.0-640.0 sec 11.4 MBytes 19.1 Mbits/sec
[ 3] 640.0-645.0 sec 0.00 Bytes 0.00 bits/sec
[ 3] 645.0-650.0 sec 5.71 MBytes 9.58 Mbits/sec
[ 3] 650.0-655.0 sec 29.6 MBytes 49.6 Mbits/sec
[ 3] 655.0-660.0 sec 46.7 MBytes 78.3 Mbits/sec
[ 714.523315] ===EMAC: POLL FUNCTION has not been called for > 1/2 second!!!
[ 714.530609] ===EMAC: TX FUNCTION has not been called for > 1/2 second!!!
[ 3] 660.0-665.0 sec 42.3 MBytes 71.0 Mbits/sec
[ 3] 665.0-670.0 sec 0.00 Bytes 0.00 bits/sec
[ 3] 670.0-675.0 sec 37.5 MBytes 62.9 Mbits/sec
[ 3] 675.0-680.0 sec 53.1 MBytes 89.1 Mbits/sec
[ 3] 680.0-685.0 sec 44.8 MBytes 75.2 Mbits/sec
[ 3] 685.0-690.0 sec 49.2 MBytes 82.6 Mbits/sec
[ 3] 690.0-695.0 sec 49.0 MBytes 82.2 Mbits/sec
[ 747.616729] ===EMAC: POLL FUNCTION has not been called for > 1/2 second!!!
[ 747.624023] ===EMAC: TX FUNCTION has not been called for > 1/2 second!!!
[ 3] 695.0-700.0 sec 40.2 MBytes 67.4 Mbits/sec
[ 3] 700.0-705.0 sec 5.38 MBytes 9.03 Mbits/sec
[ 3] 705.0-710.0 sec 51.8 MBytes 86.9 Mbits/sec
Starting <Custom App>2...
[ OK ] Started <Custom App>2.
[ 3] 710.0-715.0 sec 51.9 MBytes 87.1 Mbits/sec
[ 3] 715.0-720.0 sec 54.1 MBytes 90.8 Mbits/sec
[ 3] 720.0-725.0 sec 53.6 MBytes 89.9 Mbits/sec
[ 3] 725.0-730.0 sec 53.3 MBytes 89.4 Mbits/sec
[ 3] 730.0-735.0 sec 53.2 MBytes 89.3 Mbits/sec
[ 3] 735.0-740.0 sec 56.1 MBytes 94.1 Mbits/sec
[ 3] 740.0-745.0 sec 52.0 MBytes 87.3 Mbits/sec
[ 3] 745.0-750.0 sec 52.5 MBytes 88.1 Mbits/sec
[ 3] 750.0-755.0 sec 55.9 MBytes 93.8 Mbits/sec
[ 3] 755.0-760.0 sec 55.9 MBytes 93.9 Mbits/sec
[ 3] 760.0-765.0 sec 53.3 MBytes 89.5 Mbits/sec
^C[ 3] 0.0-765.9 sec 7.93 GBytes 88.9 Mbits/sec
===================================================================================================
Shut down iperf and drop back to the default idling
===================================================================================================
root@test:~# [ 809.822906] ===EMAC: POLL FUNCTION has not been called for >
1/2 second!!!
[ 809.960937] ===EMAC: TX FUNCTION has not been called for > 1/2 second!!!
[ 810.822845] ===EMAC: POLL FUNCTION has not been called for > 1/2 second!!!
[ 810.960937] ===EMAC: TX FUNCTION has not been called for > 1/2 second!!!
[ 813.143280] ===EMAC: POLL FUNCTION has not been called for > 1/2 second!!!
[ 813.355712] ===EMAC: TX FUNCTION has not been called for > 1/2 second!!!
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html