Hi all,

I have 17 new nodes, all identical.  Identical hw, identical sw, and I can 
reproduce this problem on any one of the nodes (well, honestly, I only tried 
three, but I assume the other 14 will have the same issue).  But I suppose that 
doesn't necessarily rule out a hw problem, they could all have the same hw 
problem.

The problem is that the NIC drops off the network under heavy load.  I am now 
able to reproduce this quickly and easily by booting the node and running a 
disk benchmark over the network to my I/O server.

So I run 'bonnie++ -u nobody -d /gpfs/fs0/test/'

In another terminal, I can do 'watch --differences "cat /proc/interrupts"'

After a while, some interrupts appear on the interrupt 82 line:


Every 2.0s: cat /proc/interrupts                                                
                   Tue Feb 16 18:13:45 2010

           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       
CPU6       CPU7
  0:     836608          0          0          0          0          0          
0          0    IO-APIC-edge  timer
  1:          3          0          0          0          0          0          
0          0    IO-APIC-edge  i8042
  8:          1          0          0          0          0          0          
0          0    IO-APIC-edge  rtc
  9:          0          0          0          0          0          0          
0          0   IO-APIC-level  acpi
 12:          4          0          0          0          0          0          
0          0    IO-APIC-edge  i8042
 58:       8159       1010       2069          0          0          0          
0          0         PCI-MSI  ahci
 74:        637          0          0          0          0          0          
0    1532887       PCI-MSI-X  eth0-Q0
 82:       7585          0          0          0          0          0      
83121          0       PCI-MSI-X  eth0
169:         60          0          0          0          0          0          
0          0   IO-APIC-level  ehci_hcd:usb1
233:         63          0          0          0          0          0          
0          0   IO-APIC-level  ehci_hcd:usb2
NMI:        854         87        531         96        100         85        
166        107
LOC:     836190     836569     836462     836354     835971     836134     
836028     835070
ERR:          0
MIS:          0


After a few more seconds, the network connection drops; the below listing 
appears on the console.

Any suggestions?  I've already tried a bunch of kernel parameters, and e1000e 
module parameters, they seem to have no effect on the outcome.


0000:04:00.0: eth0: Detected Hardware Unit Hang:
  TDH                  <3e>
  TDT                  <1f>
  next_to_use          <1f>
  next_to_clean        <30>
buffer_info[next_to_clean]:
  time_stamp           <10004ec36>
  next_to_watch        <41>
  jiffies              <10004f58f>
  next_to_watch.status <0>
MAC Status             <80783>
PHY Status             <796d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status    <3000>
PCI Status             <18>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
  TDH                  <3e>
  TDT                  <1f>
  next_to_use          <1f>
  next_to_clean        <30>
buffer_info[next_to_clean]:
  time_stamp           <10004ec36>
  next_to_watch        <41>
  jiffies              <10004fd60>
  next_to_watch.status <0>
MAC Status             <80783>
PHY Status             <796d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status    <3000>
PCI Status             <10>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
  TDH                  <6f>
  TDT                  <85>
  next_to_use          <85>
  next_to_clean        <6c>
buffer_info[next_to_clean]:
  time_stamp           <10005565e>
  next_to_watch        <70>
  jiffies              <100055f83>
  next_to_watch.status <0>
MAC Status             <80783>
PHY Status             <796d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status    <3000>
PCI Status             <10>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
  TDH                  <6f>
  TDT                  <85>
  next_to_use          <85>
  next_to_clean        <6c>
buffer_info[next_to_clean]:
  time_stamp           <10005565e>
  next_to_watch        <70>
  jiffies              <100056753>
  next_to_watch.status <0>
MAC Status             <80783>
PHY Status             <796d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status    <3000>
PCI Status             <10>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
  TDH                  <6f>
  TDT                  <85>
  next_to_use          <85>
  next_to_clean        <6c>
buffer_info[next_to_clean]:
  time_stamp           <10005565e>
  next_to_watch        <70>
  jiffies              <100056f23>
  next_to_watch.status <0>
MAC Status             <80783>
PHY Status             <796d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status    <3000>
PCI Status             <10>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
  TDH                  <2b>
  TDT                  <51>
  next_to_use          <51>
  next_to_clean        <21>
buffer_info[next_to_clean]:
  time_stamp           <100067de9>
  next_to_watch        <2d>
  jiffies              <10006896f>
  next_to_watch.status <0>
MAC Status             <80783>
PHY Status             <796d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status    <3000>
PCI Status             <10>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
  TDH                  <2b>
  TDT                  <51>
  next_to_use          <51>
  next_to_clean        <21>
buffer_info[next_to_clean]:
  time_stamp           <100067de9>
  next_to_watch        <2d>
  jiffies              <10006913f>
  next_to_watch.status <0>
MAC Status             <80783>
PHY Status             <796d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status    <3000>
PCI Status             <10>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
  TDH                  <2b>
  TDT                  <51>
  next_to_use          <51>
  next_to_clean        <21>
buffer_info[next_to_clean]:
  time_stamp           <100067de9>
  next_to_watch        <2d>
  jiffies              <10006990f>
  next_to_watch.status <0>
MAC Status             <80783>
PHY Status             <796d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status    <3000>
PCI Status             <18>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
  TDH                  <2b>
  TDT                  <51>
  next_to_use          <51>
  next_to_clean        <21>
buffer_info[next_to_clean]:
  time_stamp           <100067de9>
  next_to_watch        <2d>
  jiffies              <10006a0df>
  next_to_watch.status <0>
MAC Status             <80783>
PHY Status             <796d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status    <3000>
PCI Status             <10>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
  TDH                  <89>
  TDT                  <b3>
  next_to_use          <b3>
  next_to_clean        <7b>
buffer_info[next_to_clean]:
  time_stamp           <10006e5d4>
  next_to_watch        <8c>
  jiffies              <10006ef67>
  next_to_watch.status <0>
MAC Status             <80783>
PHY Status             <796d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status    <3000>
PCI Status             <10>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
  TDH                  <89>
  TDT                  <b3>
  next_to_use          <b3>
  next_to_clean        <7b>
buffer_info[next_to_clean]:
  time_stamp           <10006e5d4>
  next_to_watch        <8c>
  jiffies              <10006f738>
  next_to_watch.status <0>
MAC Status             <80783>
PHY Status             <796d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status    <3000>
PCI Status             <10>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
  TDH                  <12>
  TDT                  <6a>
  next_to_use          <6a>
  next_to_clean        <a>
buffer_info[next_to_clean]:
  time_stamp           <1000754bb>
  next_to_watch        <1c>
  jiffies              <100075923>
  next_to_watch.status <0>
MAC Status             <80783>
PHY Status             <796d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status    <3000>
PCI Status             <10>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
  TDH                  <12>
  TDT                  <6a>
  next_to_use          <6a>
  next_to_clean        <a>
buffer_info[next_to_clean]:
  time_stamp           <1000754bb>
  next_to_watch        <1c>
  jiffies              <1000760f3>
  next_to_watch.status <0>
MAC Status             <80783>
PHY Status             <796d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status    <3000>
PCI Status             <18>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
  TDH                  <12>
  TDT                  <6a>
  next_to_use          <6a>
  next_to_clean        <a>
buffer_info[next_to_clean]:
  time_stamp           <1000754bb>
  next_to_watch        <1c>
  jiffies              <1000768c4>
  next_to_watch.status <0>
MAC Status             <80783>
PHY Status             <796d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status    <3000>
PCI Status             <18>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
  TDH                  <74>
  TDT                  <cf>
  next_to_use          <cf>
  next_to_clean        <6e>
buffer_info[next_to_clean]:
  time_stamp           <10007ca15>
  next_to_watch        <80>
  jiffies              <10007cee4>
  next_to_watch.status <0>
MAC Status             <80783>
PHY Status             <796d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status    <3000>
PCI Status             <18>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
  TDH                  <74>
  TDT                  <cf>
  next_to_use          <cf>
  next_to_clean        <6e>
buffer_info[next_to_clean]:
  time_stamp           <10007ca15>
  next_to_watch        <80>
  jiffies              <10007d6b4>
  next_to_watch.status <0>
MAC Status             <80783>
PHY Status             <796d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status    <3000>
PCI Status             <18>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
  TDH                  <74>
  TDT                  <cf>
  next_to_use          <cf>
  next_to_clean        <6e>
buffer_info[next_to_clean]:
  time_stamp           <10007ca15>
  next_to_watch        <80>
  jiffies              <10007de84>
  next_to_watch.status <0>
MAC Status             <80783>
PHY Status             <796d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status    <3000>
PCI Status             <10>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
  TDH                  <74>
  TDT                  <cf>
 next_to_use          <cf>
  next_to_clean        <6e>
buffer_info[next_to_clean]:
  time_stamp           <10007ca15>
  next_to_watch        <80>
  jiffies              <10007e654>
  next_to_watch.status <0>
MAC Status             <80783>
PHY Status             <796d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status    <3000>
PCI Status             <18>
irq 82: nobody cared (try booting with the "irqpoll" option)
Call Trace:
 <IRQ>  [<ffffffff800babaf>] __report_bad_irq+0x30/0x7d
 [<ffffffff800bade2>] note_interrupt+0x1e6/0x227
 [<ffffffff800ba2de>] __do_IRQ+0xbd/0x103
 [<ffffffff8001231e>] __do_softirq+0x89/0x133
 [<ffffffff8006c9bf>] do_IRQ+0xe7/0xf5
 [<ffffffff8005726a>] mwait_idle+0x0/0x4a
 [<ffffffff8005d615>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff800572a0>] mwait_idle+0x36/0x4a
 [<ffffffff8004947b>] cpu_idle+0x95/0xb8
 [<ffffffff80077474>] start_secondary+0x498/0x4a7
handlers:
[<ffffffff881b5ed0>] (e1000_msix_other+0x0/0x90 [e1000e])
Disabling IRQ #82
0000:04:00.0: eth0: Detected Hardware Unit Hang:
  TDH                  <32>
  TDT                  <d5>
  next_to_use          <d5>
  next_to_clean        <22>
buffer_info[next_to_clean]:
  time_stamp           <100083619>
  next_to_watch        <33>
  jiffies              <1000840b0>
  next_to_watch.status <0>
MAC Status             <80783>
PHY Status             <796d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status    <3000>
PCI Status             <10>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
  TDH                  <32>
  TDT                  <d5>
  next_to_use          <d5>
  next_to_clean        <22>
buffer_info[next_to_clean]:
  time_stamp           <100083619>
  next_to_watch        <33>
  jiffies              <100084880>
  next_to_watch.status <0>
MAC Status             <80783>
PHY Status             <796d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status    <3000>
PCI Status             <10>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
  TDH                  <32>
  TDT                  <d5>
  next_to_use          <d5>
  next_to_clean        <22>
buffer_info[next_to_clean]:
  time_stamp           <100083619>
  next_to_watch        <33>
  jiffies              <100085050>
  next_to_watch.status <0>
MAC Status             <80783>
PHY Status             <796d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status    <3000>
PCI Status             <10>
0000:04:00.0: eth0: Detected Hardware Unit Hang:
  TDH                  <32>
  TDT                  <d5>
  next_to_use          <d5>
  next_to_clean        <22>
buffer_info[next_to_clean]:
  time_stamp           <100083619>
  next_to_watch        <33>
  jiffies              <100085820>
  next_to_watch.status <0>
MAC Status             <80783>
PHY Status             <796d>
PHY 1000BASE-T Status  <3c00>
PHY Extended Status    <3000>
PCI Status             <10>

Regards,
-- 
Alex Chekholko   [email protected] 

------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to