Hi all,
I have 17 new nodes, all identical. Identical hw, identical sw, and I can
reproduce this problem on any one of the nodes (well, honestly, I only tried
three, but I assume the other 14 will have the same issue). But I suppose that
doesn't necessarily rule out a hw problem, they could all have the same hw
problem.
The problem is that the NIC drops off the network under heavy load. I am now
able to reproduce this quickly and easily by booting the node and running a
disk benchmark over the network to my I/O server.
So I run 'bonnie++ -u nobody -d /gpfs/fs0/test/'
In another terminal, I can do 'watch --differences "cat /proc/interrupts"'
After a while, some interrupts appear on the interrupt 82 line:
Every 2.0s: cat /proc/interrupts
Tue Feb 16 18:13:45 2010
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5
CPU6 CPU7
0: 836608 0 0 0 0 0
0 0 IO-APIC-edge timer
1: 3 0 0 0 0 0
0 0 IO-APIC-edge i8042
8: 1 0 0 0 0 0
0 0 IO-APIC-edge rtc
9: 0 0 0 0 0 0
0 0 IO-APIC-level acpi
12: 4 0 0 0 0 0
0 0 IO-APIC-edge i8042
58: 8159 1010 2069 0 0 0
0 0 PCI-MSI ahci
74: 637 0 0 0 0 0
0 1532887 PCI-MSI-X eth0-Q0
82: 7585 0 0 0 0 0
83121 0 PCI-MSI-X eth0
169: 60 0 0 0 0 0
0 0 IO-APIC-level ehci_hcd:usb1
233: 63 0 0 0 0 0
0 0 IO-APIC-level ehci_hcd:usb2
NMI: 854 87 531 96 100 85
166 107
LOC: 836190 836569 836462 836354 835971 836134
836028 835070
ERR: 0
MIS: 0
After a few more seconds, the network connection drops; the below listing
appears on the console.
Any suggestions? I've already tried a bunch of kernel parameters, and e1000e
module parameters, they seem to have no effect on the outcome.
0000:04:00.0: eth0: Detected Hardware Unit Hang:
TDH <3e>
TDT <1f>
next_to_use <1f>
next_to_clean <30>
buffer_info[next_to_clean]:
time_stamp <10004ec36>
next_to_watch <41>
jiffies <10004f58f>
next_to_watch.status <0>
MAC Status <80783>
PHY Status <796d>
PHY 1000BASE-T Status <3c00>
PHY Extended Status <3000>
PCI Status <18>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
TDH <3e>
TDT <1f>
next_to_use <1f>
next_to_clean <30>
buffer_info[next_to_clean]:
time_stamp <10004ec36>
next_to_watch <41>
jiffies <10004fd60>
next_to_watch.status <0>
MAC Status <80783>
PHY Status <796d>
PHY 1000BASE-T Status <3c00>
PHY Extended Status <3000>
PCI Status <10>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
TDH <6f>
TDT <85>
next_to_use <85>
next_to_clean <6c>
buffer_info[next_to_clean]:
time_stamp <10005565e>
next_to_watch <70>
jiffies <100055f83>
next_to_watch.status <0>
MAC Status <80783>
PHY Status <796d>
PHY 1000BASE-T Status <3c00>
PHY Extended Status <3000>
PCI Status <10>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
TDH <6f>
TDT <85>
next_to_use <85>
next_to_clean <6c>
buffer_info[next_to_clean]:
time_stamp <10005565e>
next_to_watch <70>
jiffies <100056753>
next_to_watch.status <0>
MAC Status <80783>
PHY Status <796d>
PHY 1000BASE-T Status <3c00>
PHY Extended Status <3000>
PCI Status <10>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
TDH <6f>
TDT <85>
next_to_use <85>
next_to_clean <6c>
buffer_info[next_to_clean]:
time_stamp <10005565e>
next_to_watch <70>
jiffies <100056f23>
next_to_watch.status <0>
MAC Status <80783>
PHY Status <796d>
PHY 1000BASE-T Status <3c00>
PHY Extended Status <3000>
PCI Status <10>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
TDH <2b>
TDT <51>
next_to_use <51>
next_to_clean <21>
buffer_info[next_to_clean]:
time_stamp <100067de9>
next_to_watch <2d>
jiffies <10006896f>
next_to_watch.status <0>
MAC Status <80783>
PHY Status <796d>
PHY 1000BASE-T Status <3c00>
PHY Extended Status <3000>
PCI Status <10>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
TDH <2b>
TDT <51>
next_to_use <51>
next_to_clean <21>
buffer_info[next_to_clean]:
time_stamp <100067de9>
next_to_watch <2d>
jiffies <10006913f>
next_to_watch.status <0>
MAC Status <80783>
PHY Status <796d>
PHY 1000BASE-T Status <3c00>
PHY Extended Status <3000>
PCI Status <10>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
TDH <2b>
TDT <51>
next_to_use <51>
next_to_clean <21>
buffer_info[next_to_clean]:
time_stamp <100067de9>
next_to_watch <2d>
jiffies <10006990f>
next_to_watch.status <0>
MAC Status <80783>
PHY Status <796d>
PHY 1000BASE-T Status <3c00>
PHY Extended Status <3000>
PCI Status <18>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
TDH <2b>
TDT <51>
next_to_use <51>
next_to_clean <21>
buffer_info[next_to_clean]:
time_stamp <100067de9>
next_to_watch <2d>
jiffies <10006a0df>
next_to_watch.status <0>
MAC Status <80783>
PHY Status <796d>
PHY 1000BASE-T Status <3c00>
PHY Extended Status <3000>
PCI Status <10>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
TDH <89>
TDT <b3>
next_to_use <b3>
next_to_clean <7b>
buffer_info[next_to_clean]:
time_stamp <10006e5d4>
next_to_watch <8c>
jiffies <10006ef67>
next_to_watch.status <0>
MAC Status <80783>
PHY Status <796d>
PHY 1000BASE-T Status <3c00>
PHY Extended Status <3000>
PCI Status <10>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
TDH <89>
TDT <b3>
next_to_use <b3>
next_to_clean <7b>
buffer_info[next_to_clean]:
time_stamp <10006e5d4>
next_to_watch <8c>
jiffies <10006f738>
next_to_watch.status <0>
MAC Status <80783>
PHY Status <796d>
PHY 1000BASE-T Status <3c00>
PHY Extended Status <3000>
PCI Status <10>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
TDH <12>
TDT <6a>
next_to_use <6a>
next_to_clean <a>
buffer_info[next_to_clean]:
time_stamp <1000754bb>
next_to_watch <1c>
jiffies <100075923>
next_to_watch.status <0>
MAC Status <80783>
PHY Status <796d>
PHY 1000BASE-T Status <3c00>
PHY Extended Status <3000>
PCI Status <10>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
TDH <12>
TDT <6a>
next_to_use <6a>
next_to_clean <a>
buffer_info[next_to_clean]:
time_stamp <1000754bb>
next_to_watch <1c>
jiffies <1000760f3>
next_to_watch.status <0>
MAC Status <80783>
PHY Status <796d>
PHY 1000BASE-T Status <3c00>
PHY Extended Status <3000>
PCI Status <18>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
TDH <12>
TDT <6a>
next_to_use <6a>
next_to_clean <a>
buffer_info[next_to_clean]:
time_stamp <1000754bb>
next_to_watch <1c>
jiffies <1000768c4>
next_to_watch.status <0>
MAC Status <80783>
PHY Status <796d>
PHY 1000BASE-T Status <3c00>
PHY Extended Status <3000>
PCI Status <18>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
TDH <74>
TDT <cf>
next_to_use <cf>
next_to_clean <6e>
buffer_info[next_to_clean]:
time_stamp <10007ca15>
next_to_watch <80>
jiffies <10007cee4>
next_to_watch.status <0>
MAC Status <80783>
PHY Status <796d>
PHY 1000BASE-T Status <3c00>
PHY Extended Status <3000>
PCI Status <18>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
TDH <74>
TDT <cf>
next_to_use <cf>
next_to_clean <6e>
buffer_info[next_to_clean]:
time_stamp <10007ca15>
next_to_watch <80>
jiffies <10007d6b4>
next_to_watch.status <0>
MAC Status <80783>
PHY Status <796d>
PHY 1000BASE-T Status <3c00>
PHY Extended Status <3000>
PCI Status <18>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
TDH <74>
TDT <cf>
next_to_use <cf>
next_to_clean <6e>
buffer_info[next_to_clean]:
time_stamp <10007ca15>
next_to_watch <80>
jiffies <10007de84>
next_to_watch.status <0>
MAC Status <80783>
PHY Status <796d>
PHY 1000BASE-T Status <3c00>
PHY Extended Status <3000>
PCI Status <10>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
TDH <74>
TDT <cf>
next_to_use <cf>
next_to_clean <6e>
buffer_info[next_to_clean]:
time_stamp <10007ca15>
next_to_watch <80>
jiffies <10007e654>
next_to_watch.status <0>
MAC Status <80783>
PHY Status <796d>
PHY 1000BASE-T Status <3c00>
PHY Extended Status <3000>
PCI Status <18>
irq 82: nobody cared (try booting with the "irqpoll" option)
Call Trace:
<IRQ> [<ffffffff800babaf>] __report_bad_irq+0x30/0x7d
[<ffffffff800bade2>] note_interrupt+0x1e6/0x227
[<ffffffff800ba2de>] __do_IRQ+0xbd/0x103
[<ffffffff8001231e>] __do_softirq+0x89/0x133
[<ffffffff8006c9bf>] do_IRQ+0xe7/0xf5
[<ffffffff8005726a>] mwait_idle+0x0/0x4a
[<ffffffff8005d615>] ret_from_intr+0x0/0xa
<EOI> [<ffffffff800572a0>] mwait_idle+0x36/0x4a
[<ffffffff8004947b>] cpu_idle+0x95/0xb8
[<ffffffff80077474>] start_secondary+0x498/0x4a7
handlers:
[<ffffffff881b5ed0>] (e1000_msix_other+0x0/0x90 [e1000e])
Disabling IRQ #82
0000:04:00.0: eth0: Detected Hardware Unit Hang:
TDH <32>
TDT <d5>
next_to_use <d5>
next_to_clean <22>
buffer_info[next_to_clean]:
time_stamp <100083619>
next_to_watch <33>
jiffies <1000840b0>
next_to_watch.status <0>
MAC Status <80783>
PHY Status <796d>
PHY 1000BASE-T Status <3c00>
PHY Extended Status <3000>
PCI Status <10>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
TDH <32>
TDT <d5>
next_to_use <d5>
next_to_clean <22>
buffer_info[next_to_clean]:
time_stamp <100083619>
next_to_watch <33>
jiffies <100084880>
next_to_watch.status <0>
MAC Status <80783>
PHY Status <796d>
PHY 1000BASE-T Status <3c00>
PHY Extended Status <3000>
PCI Status <10>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
TDH <32>
TDT <d5>
next_to_use <d5>
next_to_clean <22>
buffer_info[next_to_clean]:
time_stamp <100083619>
next_to_watch <33>
jiffies <100085050>
next_to_watch.status <0>
MAC Status <80783>
PHY Status <796d>
PHY 1000BASE-T Status <3c00>
PHY Extended Status <3000>
PCI Status <10>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
TDH <32>
TDT <d5>
next_to_use <d5>
next_to_clean <22>
buffer_info[next_to_clean]:
time_stamp <100083619>
next_to_watch <33>
jiffies <100085820>
next_to_watch.status <0>
MAC Status <80783>
PHY Status <796d>
PHY 1000BASE-T Status <3c00>
PHY Extended Status <3000>
PCI Status <10>
Regards,
--
Alex Chekholko [email protected]
------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel® Ethernet, visit
http://communities.intel.com/community/wired