[E1000-devel] Adapter reset on 82576 and 82580 bonded pair

Simon Utting Thu, 13 Sep 2012 15:44:18 -0700

Hi,

Apologies if this is missing any information, I will try to be as thorough
as possible. We have hit a wall and are looking for guidance in continuing
troubleshooting, because the driver seems to be resetting the adapter. This
is speculation without a deeper understanding :-)


The only (possibly) relevant thread I could find was
http://comments.gmane.org/gmane.linux.drivers.e1000.devel/9934.

We have several physical machines containing both a dual port 82576, which
is onboard the mobo, and a quad port 82580 based expansion card. The exact
cards between machines may not be identical, but the chipsets are. They are
configured as:

82576
eth0: External bond
eth1: iSCSI over mpio to SAN
82580
eth2: External bond
eth3: iSCSI over mpio to SAN
eth4: internal network
eth5: management interface

My personal opinion is that the following section is not directly relevant
to the problem, but might help inform the scenario. The machines are
running Xen and multiple VMs. Traffic to and from the VMs travels over the
bonded pair of eth0 and eth2. The VMs access their disks via the iSCSI over
mpio connections. Access to dom0 is over the management interface.

The symptoms are these:

- on the majority of machines, at regular, but unpredictable, intervals we
see unresponsive network connectivity from the physical machines (and
therefore obviously the VMs they host)
--- on these machines modinfo igb shows:
version:        3.0.6-k2-2
- on one machine in particular, we have reproducible results (detail later)
--- this machine was running 3.0.6
--- it has been upgraded to:
version:        3.4.8
- these periods of no response can be seen on *all* interfaces, making the
igb driver and upwards the only common factors
- ifconfig shows overruns and dropped packets (detail later)
- /var/log/messages shows the following, repeatedly:

Sep 13 12:23:28 HV020 kernel:  connection3:0: ping timeout of 10 secs
expired, recv timeout 5, last rx 4312351768, last ping 4312353018, now
4312355518
Sep 13 12:23:28 HV020 kernel:  connection3:0: detected conn error (1011)
Sep 13 12:23:29 HV020 iscsid: Kernel reported iSCSI connection 3:0 error
(1011) state (3)
Sep 13 12:23:43 HV020 kernel:  session3: session recovery timed out after
15 secs
Sep 13 12:23:56 HV020 iscsid: connection3:0 is operational after recovery
(2 attempts)
Sep 13 12:31:59 HV020 kernel: bonding: bond1: link status definitely down
for interface eth0, disabling it
Sep 13 12:32:02 HV020 kernel: igb: eth0 NIC Link is Up 1000 Mbps Full
Duplex, Flow Control: RX/TX
Sep 13 12:32:02 HV020 kernel: bonding: bond1: link status definitely up for
interface eth0.

and

Sep 13 19:22:35 HV020 kernel:  connection2:0: ping timeout of 10 secs
expired, recv timeout 5, last rx 4318638456, last ping 4318639706, now
4318642245
Sep 13 19:22:35 HV020 kernel:  connection2:0: detected conn error (1011)
Sep 13 19:22:37 HV020 iscsid: Kernel reported iSCSI connection 2:0 error
(1011) state (3)
Sep 13 19:23:11 HV020 iscsid: connection2:0 is operational after recovery
(1 attempts)
Sep 13 19:27:13 HV020 kernel: NETDEV WATCHDOG: eth2: transmit timed out
Sep 13 19:27:16 HV020 kernel: bonding: bond1: link status definitely down
for interface eth2, disabling it
Sep 13 19:27:19 HV020 kernel: igb: eth4 NIC Link is Up 1000 Mbps Full
Duplex, Flow Control: None
Sep 13 19:27:19 HV020 kernel: igb: eth5 NIC Link is Up 1000 Mbps Full
Duplex, Flow Control: None
Sep 13 19:27:20 HV020 kernel: igb: eth2 NIC Link is Up 1000 Mbps Full
Duplex, Flow Control: RX/TX
Sep 13 19:27:20 HV020 kernel: bonding: bond1: link status definitely up for
interface eth2.
Sep 13 19:27:26 HV020 kernel: igb: eth3 NIC Link is Up 1000 Mbps Full
Duplex, Flow Control: RX/TX

As mentioned, on one machine we can reproduce this reliably. The machine
has just 4 VMs;
- 3 in a cluster, generating anywhere up to 30k of connections, but only
600Kbps incoming and 1Mbps outgoing.
- 1 receiving an SFTP transfer at ~1Mbps

The load produced by the cluster of machines follows an inverse bell curve,
peaking every 60 mins. Throughout any given hour, the following symptoms
can be seen, but with higher frequency and longer duration around the peak:
- slow response times from the cluster, taking 30-90s to respond to a HTTP
request (traffic is over eth0/eth2 bond)
- tcpdumps of the SFTP traffic, at multiple points in the network stack,
slow, or pause completely (traffic is over eth0/eth2 bond)
- timeout errors in /var/log/messages regarding iSCSI traffic (eth1 and
eth3)
- long pauses and occasional disconnects on SSH sessions to dom0 (eth5)

The overruns and dropped packets coincide with periods of high load, when
the SFTP traffic will stop completely and SSH to dom0 becomes unresponsive.
It often also coincides with an interface reset.

We can prevent the overruns (as I suppose as should be expected) by
increasing the size of the ring RX variable, to fully buffer the packets
while the device is unresponsive.

I appreciate that there will need to be further diagnostic work done to
ascertain the problem. Any guidance is appreciated.

Regards,

Simon

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://ad.doubleclick.net/clk;258768047;13503038;j?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html

_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

[E1000-devel] Adapter reset on 82576 and 82580 bonded pair

Reply via email to