I believe there is a problem with the software LRO in the ixgbe driver.  With 
LRO enabled, my cluster application hangs where two processes have data to send 
to each other as indicated by looking at the send queue with netstat(8) but it 
is not making progress even though the receive queues are empty.  If I build 
the driver without LRO (make CFLAGS_EXTRA="-DIXGBE_NO_LRO" install), this issue 
goes away. These are compute nodes that do not do routing or IP forwarding.  
The hang is easily reproduced.  The particulars follow.

Roy Larsen
Intel Corp.
[email protected]<mailto:[email protected]>
JF5-3-J4

------------------

Red Hat EL5.3 (2.6.18-128.el5 kernel)
Dual socket Nehalem 2.9GHz nodes (8 cores) with 12GB of memory, hyper-threading 
disabled
Ethernet controller: Intel Corporation 82598EB 10-Gigabit AT CX4 Network 
Connection (rev 01)
Fujitsu xg700 switch
8 nodes (64 cores)
Intel MPI 4.0.0.014

[r...@cstnh-1 library]# ethtool -i eth2
driver: ixgbe
version: 2.0.44.14-NAPI
firmware-version: 1.8-0
bus-info: 0000:02:00.0

ixgbe driver loaded with following options:
modprobe ixgbe InterruptThrottleRate=0,0

netstat -t on node "nh1-eth2"

Proto Recv-Q Send-Q Local Address               Foreign Address             
State
tcp        0   5224 nh1-eth2:55716              nh2-eth2:44115              
ESTABLISHED

netstat -t on node "nh2-eth2"

Proto Recv-Q Send-Q Local Address               Foreign Address             
State
tcp        0 331648 nh2-eth2:44115              nh1-eth2:55716              
ESTABLISHED

The tcpdump(8) trace shows the connection is not making progress

[r...@cstnh-1 library]# tcpdump -i eth2 -v host nh2-eth2 and host nh1-eth2 and 
port 55716 and port 44115
tcpdump-tnic: listening on eth2, link-type EN10MB (Ethernet), capture size 96 
bytes
17:17:17.092120 IP (tos 0x0, ttl 64, id 95, offset 0, flags [DF], proto TCP 
(6), length 1500)
    nh2-eth2.44115 > nh1-eth2.55716: Flags [.], ack 162998588, win 382, length 
1460
17:17:17.092227 IP (tos 0x0, ttl 64, id 56737, offset 0, flags [DF], proto TCP 
(6), length 40)
    nh1-eth2.55716 > nh2-eth2.44115: Flags [.], cksum 0x99b4 (correct), ack 
39426, win 382, length 0
17:17:17.348305 IP (tos 0x0, ttl 64, id 56738, offset 0, flags [DF], proto TCP 
(6), length 1500)
    nh1-eth2.55716 > nh2-eth2.44115: Flags [.], ack 39426, win 382, length 1460
17:17:17.348326 IP (tos 0x0, ttl 64, id 96, offset 0, flags [DF], proto TCP 
(6), length 40)
    nh2-eth2.44115 > nh1-eth2.55716: Flags [.], cksum 0x99b6 (correct), ack 1, 
win 382, length 0
17:17:17.348331 IP (tos 0x0, ttl 64, id 56739, offset 0, flags [DF], proto TCP 
(6), length 40)
    nh1-eth2.55716 > nh2-eth2.44115: Flags [.], cksum 0x99b4 (correct), ack 
39426, win 382, length 0
17:18:08.548706 IP (tos 0x0, ttl 64, id 97, offset 0, flags [DF], proto TCP 
(6), length 1500)
    nh2-eth2.44115 > nh1-eth2.55716: Flags [.], ack 1, win 382, length 1460
17:18:08.548711 IP (tos 0x0, ttl 64, id 56740, offset 0, flags [DF], proto TCP 
(6), length 40)
    nh1-eth2.55716 > nh2-eth2.44115: Flags [.], cksum 0x99b4 (correct), ack 
39426, win 382, length 0
17:18:09.060306 IP (tos 0x0, ttl 64, id 56741, offset 0, flags [DF], proto TCP 
(6), length 1500)
    nh1-eth2.55716 > nh2-eth2.44115: Flags [.], ack 39426, win 382, length 1460
17:18:09.060327 IP (tos 0x0, ttl 64, id 98, offset 0, flags [DF], proto TCP 
(6), length 40)
    nh2-eth2.44115 > nh1-eth2.55716: Flags [.], cksum 0x99b6 (correct), ack 1, 
win 382, length 0
17:18:09.060332 IP (tos 0x0, ttl 64, id 56742, offset 0, flags [DF], proto TCP 
(6), length 40)
    nh1-eth2.55716 > nh2-eth2.44115: Flags [.], cksum 0x99b4 (correct), ack 
39426, win 382, length 0
17:19:51.461901 IP (tos 0x0, ttl 64, id 99, offset 0, flags [DF], proto TCP 
(6), length 1500)
    nh2-eth2.44115 > nh1-eth2.55716: Flags [.], ack 1, win 382, length 1460
17:19:51.461909 IP (tos 0x0, ttl 64, id 56743, offset 0, flags [DF], proto TCP 
(6), length 40)
    nh1-eth2.55716 > nh2-eth2.44115: Flags [.], cksum 0x99b4 (correct), ack 
39426, win 382, length 0
17:19:52.484306 IP (tos 0x0, ttl 64, id 56744, offset 0, flags [DF], proto TCP 
(6), length 1500)
    nh1-eth2.55716 > nh2-eth2.44115: Flags [.], ack 39426, win 382, length 1460
17:19:52.484328 IP (tos 0x0, ttl 64, id 100, offset 0, flags [DF], proto TCP 
(6), length 40)
    nh2-eth2.44115 > nh1-eth2.55716: Flags [.], cksum 0x99b6 (correct), ack 1, 
win 382, length 0
17:19:52.484333 IP (tos 0x0, ttl 64, id 56745, offset 0, flags [DF], proto TCP 
(6), length 40)
    nh1-eth2.55716 > nh2-eth2.44115: Flags [.], cksum 0x99b4 (correct), ack 
39426, win 382, length 0
17:21:51.463283 IP (tos 0x0, ttl 64, id 101, offset 0, flags [DF], proto TCP 
(6), length 1500)
    nh2-eth2.44115 > nh1-eth2.55716: Flags [.], ack 1, win 382, length 1460
17:21:51.463288 IP (tos 0x0, ttl 64, id 56746, offset 0, flags [DF], proto TCP 
(6), length 40)
    nh1-eth2.55716 > nh2-eth2.44115: Flags [.], cksum 0x99b4 (correct), ack 
39426, win 382, length 0
17:21:52.484305 IP (tos 0x0, ttl 64, id 56747, offset 0, flags [DF], proto TCP 
(6), length 1500)
    nh1-eth2.55716 > nh2-eth2.44115: Flags [.], ack 39426, win 382, length 1460
17:21:52.484327 IP (tos 0x0, ttl 64, id 102, offset 0, flags [DF], proto TCP 
(6), length 40)
    nh2-eth2.44115 > nh1-eth2.55716: Flags [.], cksum 0x99b6 (correct), ack 1, 
win 382, length 0
17:21:52.484332 IP (tos 0x0, ttl 64, id 56748, offset 0, flags [DF], proto TCP 
(6), length 40)
    nh1-eth2.55716 > nh2-eth2.44115: Flags [.], cksum 0x99b4 (correct), ack 
39426, win 382, length 0
------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel

Reply via email to