(( Attempting to re-post this since Yahoo! shipped the previous one as HTML... 
))

All,

My team is presently seeing *extremely poor* (on the order of single-digit 
Mbps) Ethernet performance out of an AM3517-based COM (Technexion's TAM-3517 in 
this case) when it _transmits TCP_. Receiving TCP appears to happen fine, and 
our UDP transmit and receive appears pretty solid. Is anyone else seeing 
anything like this on an AM3517-based platform? (I have a CompuLab CM-T3517 
that I'll try to get to by the end of this week for comparison.)

I reported a similar, perhaps related, issue nearly a year ago at 
http://thread.gmane.org/gmane.linux.ports.arm.omap/78647 & 
http://e2e.ti.com/support/arm/sitara_arm/f/416/t/195442.aspx, and never heard 
much in response. Though the performance of the EMAC port has never been 
stellar (others have admitted that), we've continued working with the COM 
because the network performance our tests were seeing at the time was more than 
adequate to our tasks at hand. Recently however, while testing our latest 
hardware we hit this nasty performance snag and that caused us to revisit this 
entirely. Frustratingly, these tests are showing that that performance now 
appears to be way worse than anything we previously saw, on both our custom 
hardware and dev. kit systems.

The behavior is easily characterized using 'iperf'. If the TAM hosts the iperf 
server (i.e. receives TCP using 'iperf -s'), a client can connect to it and run 
~90Mbps forever. That's perfect. If those roles are reversed however, and the 
TAM plays client (i.e. transmits TCP using 'iperf -i 10 -t 60 -c <server_ip>'), 
the data rate becomes sporadic and often plummets or even times out. Please see 
captures below. Although it misbehaves dramatically, the driver never registers 
a single error, xrun, nothing...

*** EMAC running server (receiving TCP) ***
$ iperf -i 10 -t 60 -c 10.22.0.17
------------------------------------------------------------
Client connecting to 10.22.0.17, TCP port 5001
TCP window size: 23.5 KByte (default)
------------------------------------------------------------
[ 3] local 10.22.255.5 port 60936 connected with 10.22.0.17 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 113 MBytes 94.5 Mbits/sec
[ 3] 10.0-20.0 sec 112 MBytes 94.3 Mbits/sec
[ 3] 20.0-30.0 sec 112 MBytes 94.0 Mbits/sec
[ 3] 30.0-40.0 sec 112 MBytes 94.2 Mbits/sec
[ 3] 40.0-50.0 sec 112 MBytes 94.3 Mbits/sec
[ 3] 50.0-60.0 sec 112 MBytes 94.0 Mbits/sec
[ 3] 0.0-60.0 sec 674 MBytes 94.2 Mbits/sec

*** EMAC running client (transmitting TCP) ***
# iperf -i 10 -t 60 -c 10.22.255.5
------------------------------------------------------------
Client connecting to 10.22.255.5, TCP port 5001
TCP window size: 19.6 KByte (default)
------------------------------------------------------------
[ 3] local 10.22.0.17 port 43185 connected with 10.22.255.5 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 92.5 MBytes 77.6 Mbits/sec
[ 3] 10.0-20.0 sec 102 MBytes 85.3 Mbits/sec
[ 3] 20.0-30.0 sec 98.6 MBytes 82.7 Mbits/sec
[ 3] 30.0-40.0 sec 55.4 MBytes 46.5 Mbits/sec
[ 3] 40.0-50.0 sec 2.73 MBytes 2.29 Mbits/sec
[ 3] 50.0-60.0 sec 1.26 MBytes 1.06 Mbits/sec
[ 3] 0.0-64.5 sec 352 MBytes 45.8 Mbits/sec

Since discovering this behavior at the end of last week, I have systematically 
gone back through our generations of our custom carrier boards as well as the 
TAM's Twister dev kit and proven that the issue is now present on everything we 
have. Since the behavior appears to have changed since we last aggressively 
tested this nearly a year ago, I'm assuming a slight software alteration 
somewhere is largely to blame. So I walked back through all of my recorded boot 
logs and retried running our main previous kernels (l-o 3.4-rc6 and l-o 
3.5-rc4) as well as older versions of the bootloaders. In every case, the 
problem has remained.

The latest software we're running is still based on linux-omap's 3.5-rc4. We 
locked the kernel down there several months ago in order to stage for release, 
and until we discovered this last week it has been running _very_ stably. I 
have, however, continued to monitor the lists and major patch sites looking to 
see if any major bug fixes are released in the drivers we're using, etc. Since 
discovering this issue, I've also gone ahead and backported in many of the 
patches released by the folks I CC'd onto this message - at least those I could 
easily pull in without upgrading the kernel. Unless I'm overlooking something, 
it now looks to me like I have everything but the DT and OF stuff worked into 
our kernel. (I'm assuming that DT and OF stuff really does not impact 
performance. Is that a safe assumption?)  Unfortunately, pulling in those 
changes has not corrected this issue.

We've done network captures on our link, and the problem is very strange. The 
iperf client transmits data quickly and steadily for a while, but then all the 
sudden just stops. In the captures you can see an ACK come back from the server 
for the frame that was just sent, but then instead of immediately sending the 
next one it just sits there for sometimes several seconds. Then, it suddenly 
picks back up and starts running again. It's as though it just paused due to 
lack of data. Again, no errors or xruns are ever triggered, and even with full 
NETIF_MSG debugging on, we're getting nothing.

One other note: The more I play around with this the more I'm noticing that 
manually increasing the TCP window size helps things dramatically.

*** EMAC running client (transmitting TCP) WITH larger window ***
# iperf -i 10 -t 60 -c 10.22.255.5 -w 85K
------------------------------------------------------------
Client connecting to 10.22.255.5, TCP port 5001
TCP window size: 170 KByte (WARNING: requested 85.0 KByte)
------------------------------------------------------------
[ 3] local 10.22.0.17 port 43189 connected with 10.22.255.5 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 105 MBytes 88.3 Mbits/sec
[ 3] 10.0-20.0 sec 92.0 MBytes 77.2 Mbits/sec
[ 3] 20.0-30.0 sec 94.8 MBytes 79.6 Mbits/sec
[ 3] 30.0-40.0 sec 88.3 MBytes 74.1 Mbits/sec
[ 3] 40.0-50.0 sec 95.8 MBytes 80.3 Mbits/sec
[ 3] 50.0-60.0 sec 105 MBytes 87.9 Mbits/sec
[ 3] 0.0-60.0 sec 581 MBytes 81.2 Mbits/sec

While encouraging, I know I should not have to do this. This feels like it's 
simply masking the real problem.

I've looked closely sysctl parameters associated with the ipv4 stuff (we're not 
using ipv6), and have contrasted that against parameters for several systems 
around here. Again, I'm not finding anything obvious.

Has anyone seen anything like this out of TI's DaVinci EMAC before and/or does 
anyone have any idea what could be causing this? Any and all help in tracking 
this down would be greatly appreciated! To anyone willing to help, I'll happily 
provide as much info as I can. Please just ask.

While I would prefer to not do so given that it would void the months of 
testing we've already done, if necessary I can look into pushing our kernel 
forward toward the leading edge of the l-o series.  However, I would prefer 
other efforts be exhausted first given the sacrifice in testing that would 
require.

Thanks in advance and thanks to all who contribute to these excellent open 
source tools and products.
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to