We have a system with a SuperMicro X8DTH-6F dual socket motherboard with dual Xeon X5680 processors. The X8DTH-6F motherboard has dual GigE NIC powered by the Intel 82576 (kawela) chipset [attached by igb driver]. We've got S10u9 installed:
$ uname -a SunOS r10.atl.efoldering.com 5.10 Generic_144489-06 i86pc i386 i86pc The symptoms are very similar to what was reported here: http://opensolaris.org/jive/thread.jspa?threadID=124411 Since the chipset is completely different in this case (igb instead of e1000g) I'm posting to a new thread. The system is under a constant incoming load between 40 Mbit/sec to 240 Mbit/sec. At random times (usually about once per day, sometimes twice per day, sometimes not at all), the system disappears off the network (it can't ping other systems, and other systems can't ping it) for between 60 seconds to up to 30 minutes, after which it will mysteriously appear on the network again all without intervention. We actually have both igb0 and igb1 interfaces wired with separate IPs, and both IPs are not pingable when the system is not visible on the network. We collected some packet count stats with kstat and we do see ipackets64 and opackets64 increasing (5 minutes between measurements), but IP receive counters are not increasing much (518 packets received, but only an increase of 1 in the IP inDelivers). Also when the system isn't visible on the network, running 'arp -an' freezes -- using dtrace on running arp shows that it freezes during the syscall to open for the file /dev/arp -- when the system decides to appear on the network again all of the pending 'arp -an' processes suddenly continue and print their output like normal. This makes me think there is somehow a deadlock in the ARP layer that is eventually timing out. Any ideas on what might be going on, or what dtrace commands we might try running to diagnose this further? To help diagnose the situation, since it occurs at random times, we wrote a cron job script do-ping-network.sh as follows: ########################################################################## #!/bin/bash LOG="/tmp/do-ping-network.log.txt" echo "========================================================================= " >> "$LOG" date >> "$LOG" ping -s 192.168.3.1 1024 4 >> "$LOG" 2>&1 echo "------------------------------------------------------------------------------------------------------------" >> "$LOG" date >> "$LOG" kstat -m igb 1 2 >> "$LOG" echo "------------------------------------------------------------------------------------------------------------" >> "$LOG" date >> "$LOG" kstat -m ip -n ip 1 2 >> "$LOG" echo "------------------------------------------------------------------------------------------------------------" >> "$LOG" date >> "$LOG" #arp -an >> "$LOG" 2>&1 /root/do-trace-arp.sh >> "$LOG" 2>&1 ########################################################################## This script is run every 5 minutes from cron, so we have a record of when the network stops working. When the network stops working ping reports 100% packet loss. Note that we run kstat to collect some stats after ping. Today we had an outage for 10 minutes. Here is the relevant kstat output from the first time after ping reported 100% packet loss: ================================================================= Tue Sep 13 19:10:13 EDT 2011 module: igb instance: 0 name: igb0 class: net ierrors 2 ipackets64 13442703584 norcvbuf 285 obytes64 5522625154592 opackets64 13191458263 rbytes64 81689304694198 module: igb instance: 0 name: mac class: net ipackets64 13442703584 obytes64 5522625154592 opackets64 13191458263 rbytes64 81689304694198 module: igb instance: 0 name: statistics class: net XOFFs_recvd 34915 XOFFs_xmitd 97389 XONs_recvd 34915 XONs_xmitd 177 recv_missed_packets 224854 module: ip instance: 0 name: ip class: mib2 inDelivers 61165291672 inDiscards 721 inReceives 61163929129 outRequests 32516565932 ================================================================= And here's the kstat output 5 minutes later (the system is still not visible on the network): ================================================================= module: igb instance: 0 name: igb0 class: net ierrors 2 ipackets64 13442704102 norcvbuf 285 obytes64 5522625158872 opackets64 13191458267 rbytes64 81689304728036 module: igb instance: 0 name: mac class: net ipackets64 13442704102 obytes64 5522625158872 opackets64 13191458267 rbytes64 81689304728036 module: igb instance: 0 name: statistics class: net XOFFs_recvd 34915 XOFFs_xmitd 97389 XONs_recvd 34915 XONs_xmitd 177 recv_missed_packets 225200 module: ip instance: 0 name: ip class: mib2 inDelivers 61165291763 inDiscards 721 inReceives 61163929129 outRequests 32516565936 ================================================================= I elided uninteresting counter values -- all error counters are 0 or very close to 0, I can provide full output if there is interest. Also note that I'm only including igb0 stats -- igb1 link is active but there is no traffic flowing to it. igb0 has a fair amount of TCP traffic being received all the time. Some observations: * ipackets64 increased by 518 packets (presumably from existing TCP connections that continued to try to send data). * opackets64 increased by 4 packets (the 4 pings that were sent out by the cron job -- note that the ping packet size is set to 1024, so that explains why the average packet size sent was 1070 [IP+ICMP headers + ping payload]) * recv_missed_packets increased by 346 packets (not sure what recv_missed_packets means for igb driver?) * ip stats inDelivers increased by 91. * ip stats inReceives did not change. * ip stats outRequests increased by 4 (the 4 pings sent by the cron job) Using dtrace we can see that when arp runs while the network is not responsive, the arp process runs fine until is tries the open syscall to open /dev/arp and that's where it gets stuck until the network finally responds again. Also, this may or may not be relevant, but when the network is down we can login on the console and ifconfig -a is responsive and does not block. Also, logging in on the console sometimes seems to almost immediately clear the problem. I wouldn't believe it if I didn't see it with my own eyes, since I can't see how these two events could possibly be connected. However, we have seen cases where we logged in on the console and the network connectivity problems persisted for several more minutes. We have also tried forcing the interface to go down and come back up using ifconfig and that seems to clear the problem immediately as well. Any ideas or suggestions will be appreciated. Thanks, Kevin -- This message posted from opensolaris.org _______________________________________________ networking-discuss mailing list networking-discuss@opensolaris.org