[E1000-devel] crash in ixgbe driver code during system reboot when operating in bonding[ active-backup ] mode

Hariharan Nagarajan -X (hanagara - HCL TECHNOLOGIES LIMITED at Cisco) Thu, 08 Sep 2011 05:02:41 -0700

Hello all,


We have application  running on Linux based OS. In one of our systems
which is based on 2.6.23 linux kernel, we have 10G Nics being  used with
in active-back bond mode to provide

interface backup capability in case one of link fails.

 

eth2 and eth3 are the system interfaces in bonding with active-backup
mode.

 

OS: Linux based on 2.6.23 kernel version

ixgbe driver version: 3.1.17

 

The system is multiprocessor systems with ~ 24 cpu cores

 

On particular instance while rebooting the system, we encountered a
kernel crash and system entered Kernel debugger mode due to double
fault.

 

with dmesgs like this

6>ACPI: PCI interrupt for device 0000:0a:00.0 disabled

<6>ACPI: PCI interrupt for device 0000:09:00.0 disabled

<3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 0 not
cleared within the polling period

<3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 1 not
cleared within the polling period

<3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 2 not
cleared within the polling period

<3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 3 not
cleared within the polling period

<3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 4 not
cleared within the polling period

<3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 5 not
cleared within the polling period

<3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 6 not
cleared within the polling period

<3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 7 not
cleared within the polling period

<3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 8 not
cleared within the polling period

<3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 9 not
cleared within the polling period

<3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 10 not
cleared within the polling period

<3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 11 not
cleared within the polling period

<3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 12 not
cleared within the polling period

<3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 13 not
cleared within the polling period

<3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 14 not
cleared within the polling period

<3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 15 not
cleared within the polling period

<6>bonding: bond5: link status definitely down for interface eth3,
disabling it

<6>bonding: bond5: making interface eth2 the new active one.

<0>double fault: 0000 [1] SMP 

[0]kdb> bt

Stack traceback for pid 0

0xffffffff80945180        0        0  1    0   R  0xffffffff80945480
*swapper

rsp                rip                Function (args)

======================= <doublefault>

kdb_bb: address 0xffffffffffffffff not recognised

Using old style backtrace, unreliable with no arguments

rsp                rip                Function (args)

======================= <doublefault>

0xffffffff80de8fd8 0xffffffff804ab4c3 ixgbe_clear_vmdq_generic+0x73

+++ Cannot resolve next stack

[0]kdb> cpu 1

 

====================================

 

While decoding the stack  on all cpus , we found  two cpus actively
doing ixgbe operation.

 

Backtraces:

  CPU 2:

Stack traceback for pid 6328

0xffff8101885fe000     6328     5757  1    2   R  0xffff8101885fe300
*reboot

rsp                rip                Function (args)

0xffff8102e1777c08 0xffffffff8071acc6 kdb_interrupt+0x66 (0x3a991,
0x3028, 0x7ddd3, 0x9e582c6e, 0x3028, 0xe028)

0xffff8102e1777c68 0xffffffff803a81ce __delay+0xe (0x3a991)

0xffff8102e1777ca0 0xffffffff803a8211 __const_udelay+0x31 (invalid)

0xffff8102e1777cb0 0xffffffff804a95bc ixgbe_disable_pcie_master+0xac
(0xffff81062f97d280)

0xffff8102e1777ce0 0xffffffff804b39dc ixgbe_reset_hw_82599+0x7c
(0xffff81062f97d280)

0xffff8102e1777d10 0xffffffff804a8d4f ixgbe_init_hw_generic+0xf
(0xffff81062f97d280)

0xffff8102e1777d30 0xffffffff804a2f43 ixgbe_reset+0x63
(0xffff81062f97c740)

0xffff8102e1777d50 0xffffffff804a355b ixgbe_down+0x2cb
(0xffff81062f97c740)

0xffff8102e1777d90 0xffffffff804a56d8 __ixgbe_shutdown+0xf8
(0xffff8103322f8800, 0xffff8102e1777ddf)

0xffff8102e1777dd0 0xffffffff804a57c5 ixgbe_shutdown+0x15
(0xffff8103322f8800)

0xffff8102e1777df0 0xffffffff803b8815 pci_device_shutdown+0x25 (invalid)

0xffff8102e1777e00 0xffffffff80422d8a device_shutdown+0x7a

0xffff8102e1777e20 0xffffffff8024040c kernel_restart_prepare+0x2c
(invalid)

0xffff8102e1777e30 0xffffffff80240431 kernel_restart+0x11 (0x0)

0xffff8102e1777e50 0xffffffff80240726 sys_reboot+0x1d6 (invalid,
invalid, invalid, 0x0)

0xffff8102e1777f80 0xffffffff80220542 ia32_sysret (invalid, invalid,
invalid, invalid)

CPU 0

(infinite loop, since IXGBE_WRITE_REG() is not taking effect), seems to
be due to reset done by other CPU shown above 

 

0xffffffff80de6d08 0xffffffff804aa387 ixgbe_clear_rar_generic+0x77

0xffffffff80de6d18 0xffffffff804ab4ca ixgbe_clear_vmdq_generic+0x7a

0xffffffff80de6d28 0xffffffff804aa387 ixgbe_clear_rar_generic+0x77

0xffffffff80de6d38 0xffffffff804ab4ca ixgbe_clear_vmdq_generic+0x7a

0xffffffff80de6d48 0xffffffff804aa387 ixgbe_clear_rar_generic+0x77

0xffffffff80de6d58 0xffffffff804a1b30 ixgbe_set_rx_mode+0x1a0

0xffffffff80de6da8 0xffffffff80659b8c __dev_set_rx_mode+0x6c

0xffffffff80de6dc8 0xffffffff8065ccba dev_mc_add+0x9a

0xffffffff80de6e08 0xffffffff804bce33 bond_change_active_slave+0x143

0xffffffff80de6e38 0xffffffff804bd1ab bond_select_active_slave+0x8b

0xffffffff80de6e58 0xffffffff804bd40c bond_mii_monitor+0x1cc

0xffffffff80de6e98 0xffffffff804bd240 bond_mii_monitor

0xffffffff80de6eb8 0xffffffff8023b2b8 run_timer_softirq+0xe8

0xffffffff80de6f08 0xffffffff802372e5 __do_softirq+0x75

0xffffffff80de6f48 0xffffffff8020d3cc call_softirq+0x1c

0xffffffff80de6f60 0xffffffff8020f329 do_softirq+0x49

0xffffffff80de6f80 0xffffffff802373d5 irq_exit+0x45

0xffffffff80de6f90 0xffffffff80219125 smp_apic_timer_interrupt+0x55

0xffffffff80de6f98 0xffffffff8020a4b0 mwait_idle

0xffffffff80de6fb0 0xffffffff8020ce76 apic_timer_interrupt+0x66

======================= <normal>

0xffffffff80a0be98 0xffffffff8020ce76 apic_timer_interrupt+0x66

0xffffffff80a0bef8 0xffffffff8020a4f3 mwait_idle+0x43

0xffffffff80a0bf20 0xffffffff8020a1d2 enter_idle+0x22

0xffffffff80a0bf30 0xffffffff8020a448 cpu_idle+0x78

0xffffffff80a0bf50 0xffffffff80717adc rest_init+0x5c

0xffffffff80a0bf60 0xffffffff80a1894f start_kernel+0x29f

This looks like an unfortunate interaction between reboot and the
bonding driver.

One cpu is running reboot, and calls ixgbe down/reset, disabling the
ixgbe.

Another cpu runs the bonding monitor, and is trying to change the
addresses 

on the bonding devices.  This ends up looping between
clear_vmda/clear_rar

because the device (being shutdown) is not responding to the writes.

It looks like some logic will be needed in order to interlock between a
ixgbe

device shutdown and attempting to rewrite the address list

 

Is this an issue which is reported earlier? If not can this issue be
addressed in future ixgbe releases considering the race condition we
have

in this execution path during shutdown?

 

Thanks,

Hari

------------------------------------------------------------------------------
Doing More with Less: The Next Generation Virtual Desktop 
What are the key obstacles that have prevented many mid-market businesses
from deploying virtual desktops?   How do next-generation virtual desktops
provide companies an easier-to-deploy, easier-to-manage and more affordable
virtual desktop model.http://www.accelacomm.com/jaw/sfnl/114/51426474/

_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

[E1000-devel] crash in ixgbe driver code during system reboot when operating in bonding[ active-backup ] mode

Reply via email to