We have a system with a SuperMicro X8DTH-6F dual socket motherboard with dual 
Xeon X5680 processors. The X8DTH-6F motherboard has dual GigE NIC powered by 
the Intel 82576 (kawela) chipset [attached by igb driver]. We've got S10u9 
installed:

$ uname -a
SunOS r10.atl.efoldering.com 5.10 Generic_144489-06 i86pc i386 i86pc

The symptoms are very similar to what was reported here:
http://opensolaris.org/jive/thread.jspa?threadID=124411

Since the chipset is completely different in this case (igb instead of e1000g) 
I'm posting to a new thread.

The system is under a constant incoming load between 40 Mbit/sec to 240 
Mbit/sec. At random times (usually about once per day, sometimes twice per day, 
sometimes not at all), the system disappears off the network (it can't ping 
other systems, and other systems can't ping it) for between 60 seconds to up to 
30 minutes, after which it will mysteriously appear on the network again all 
without intervention.

We actually have both igb0 and igb1 interfaces wired with separate IPs, and 
both IPs are not pingable when the system is not visible on the network. We 
collected some packet count stats with kstat and we do see ipackets64 and 
opackets64 increasing (5 minutes between measurements), but IP receive counters 
are not increasing much (518 packets received, but only an increase of 1 in the 
IP inDelivers). Also when the system isn't visible on the network, running 'arp 
-an' freezes -- using dtrace on running arp shows that it freezes during the 
syscall to open for the file /dev/arp -- when the system decides to appear on 
the network again all of the pending 'arp -an' processes suddenly continue and 
print their output like normal. This makes me think there is somehow a deadlock 
in the ARP layer that is eventually timing out.

Any ideas on what might be going on, or what dtrace commands we might try 
running to diagnose this further?

To help diagnose the situation, since it occurs at random times, we wrote a 
cron job script do-ping-network.sh as follows:

##########################################################################
#!/bin/bash

LOG="/tmp/do-ping-network.log.txt"

echo "========================================================================= 
" >> "$LOG"
date >> "$LOG"
ping -s 192.168.3.1 1024 4 >> "$LOG" 2>&1
echo 
"------------------------------------------------------------------------------------------------------------"
 >> "$LOG"
date >> "$LOG"
kstat -m igb 1 2 >> "$LOG"
echo 
"------------------------------------------------------------------------------------------------------------"
 >> "$LOG"
date >> "$LOG"
kstat -m ip -n ip 1 2 >> "$LOG"
echo 
"------------------------------------------------------------------------------------------------------------"
 >> "$LOG"
date >> "$LOG"
#arp -an >> "$LOG" 2>&1
/root/do-trace-arp.sh >> "$LOG" 2>&1
##########################################################################

This script is run every 5 minutes from cron, so we have a record of when the 
network stops working. When the network stops working ping reports 100% packet 
loss. Note that we run kstat to collect some stats after ping. Today we had an 
outage for 10 minutes. Here is the relevant kstat output from the first time 
after ping reported 100% packet loss:

=================================================================
Tue Sep 13 19:10:13 EDT 2011

module: igb                             instance: 0     
name:   igb0                            class:    net
        ierrors                         2
        ipackets64                      13442703584
        norcvbuf                        285
        obytes64                        5522625154592
        opackets64                      13191458263
        rbytes64                        81689304694198

module: igb                             instance: 0     
name:   mac                             class:    net
        ipackets64                      13442703584
        obytes64                        5522625154592
        opackets64                      13191458263
        rbytes64                        81689304694198

module: igb                             instance: 0     
name:   statistics                      class:    net
        XOFFs_recvd                     34915
        XOFFs_xmitd                     97389
        XONs_recvd                      34915
        XONs_xmitd                      177
        recv_missed_packets             224854


module: ip                              instance: 0     
name:   ip                              class:    mib2
        inDelivers                      61165291672
        inDiscards                      721
        inReceives                      61163929129
        outRequests                     32516565932
=================================================================

And here's the kstat output 5 minutes later (the system is still not visible on 
the network):

=================================================================
module: igb                             instance: 0     
name:   igb0                            class:    net
        ierrors                         2
        ipackets64                      13442704102
        norcvbuf                        285
        obytes64                        5522625158872
        opackets64                      13191458267
        rbytes64                        81689304728036

module: igb                             instance: 0     
name:   mac                             class:    net
        ipackets64                      13442704102
        obytes64                        5522625158872
        opackets64                      13191458267
        rbytes64                        81689304728036

module: igb                             instance: 0     
name:   statistics                      class:    net
        XOFFs_recvd                     34915
        XOFFs_xmitd                     97389
        XONs_recvd                      34915
        XONs_xmitd                      177
        recv_missed_packets             225200


module: ip                              instance: 0     
name:   ip                              class:    mib2
        inDelivers                      61165291763
        inDiscards                      721
        inReceives                      61163929129
        outRequests                     32516565936
=================================================================

I elided uninteresting counter values -- all error counters are 0 or very close 
to 0, I can provide full output if there is interest. Also note that I'm only 
including igb0 stats -- igb1 link is active but there is no traffic flowing to 
it. igb0 has a fair amount of TCP traffic being received all the time.

Some observations:
 * ipackets64 increased by 518 packets (presumably from existing TCP 
connections that continued to try to send data).
 * opackets64 increased by 4 packets (the 4 pings that were sent out by the 
cron job -- note that the ping packet size is set to 1024, so that explains why 
the average packet size sent was 1070 [IP+ICMP headers + ping payload])
 * recv_missed_packets increased by 346 packets (not sure what 
recv_missed_packets means for igb driver?)
 * ip stats inDelivers increased by 91.
 * ip stats inReceives did not change.
 * ip stats outRequests increased by 4 (the 4 pings sent by the cron job)

Using dtrace we can see that when arp runs while the network is not responsive, 
the arp process runs fine until is tries the open syscall to open /dev/arp and 
that's where it gets stuck until the network finally responds again.

Also, this may or may not be relevant, but when the network is down we can 
login on the console and ifconfig -a is responsive and does not block. Also, 
logging in on the console sometimes seems to almost immediately clear the 
problem. I wouldn't believe it if I didn't see it with my own eyes, since I 
can't see how these two events could possibly be connected. However, we have 
seen cases where we logged in on the console and the network connectivity 
problems persisted for several more minutes. We have also tried forcing the 
interface to go down and come back up using ifconfig and that seems to clear 
the problem immediately as well.

Any ideas or suggestions will be appreciated.

Thanks,
Kevin
-- 
This message posted from opensolaris.org
_______________________________________________
networking-discuss mailing list
networking-discuss@opensolaris.org

Reply via email to