Re: [E1000-devel] crash in ixgbe driver code during system reboot when operating in bonding[ active-backup ] mode

Hariharan Nagarajan -X (hanagara - HCL TECHNOLOGIES LIMITED at Cisco) Tue, 18 Oct 2011 23:34:15 -0700

Hi Don,

Thanks for your reply.


But unfortunately we never hit this issue after the first occurrence and
so not conclusive on way to reproduce this.

We only thought from kdb back traces and code path that some race
condition is there.

Trying with bonding mode in active-backup policy and just simulating
link loss during box shutdown like plugging out one of the links 
(if it is possible) may help. But we did not succeed  to reproduce that
way also.

It seems to be a problem of rare occurrence.

Thanks,
Hari

-----Original Message-----
From: Skidmore, Donald C [mailto:[email protected]] 
Sent: Tuesday, October 18, 2011 9:43 PM
To: Hariharan Nagarajan -X (hanagara - HCL TECHNOLOGIES LIMITED at
Cisco); Shirwaikar, Atita
Subject: RE: [E1000-devel] crash in ixgbe driver code during system
reboot when operating in bonding[ active-backup ] mode

Hey Hari,

We were under the mistaken believe that the upper stack was grabbing the
rtnl lock during shutdown.  This is incorrect like you pointed out.
After looking at several other drivers I can see many grab the rntl in
their shutdown path.  Strangely enough I also see a fair number who
don't; I imagine they may have the same race condition you found in
ixgbe.

We are looking at adding the lock in our shutdown path and hopefully it
will make it into our next released driver.

One thing that has slowed us up a bit is we have yet to be able to
recreate your failure.  We have tried running scripts in a tight loop
changing the multicast and unicast addresses on a port while shutting
the system down but have yet to hit this condition.  Do you have any
pointer on how you recreate race.  I would really like to be able to
verify internally any fix we come up with. 

Thanks,
-Don Skidmore <[email protected]>


>-----Original Message-----
>From: Hariharan Nagarajan -X (hanagara - HCL TECHNOLOGIES LIMITED at
>Cisco) [mailto:[email protected]]
>Sent: Tuesday, October 18, 2011 6:06 AM
>To: Hariharan Nagarajan -X (hanagara - HCL TECHNOLOGIES LIMITED at
>Cisco); Shirwaikar, Atita
>Cc: Skidmore, Donald C
>Subject: RE: [E1000-devel] crash in ixgbe driver code during system
>reboot when operating in bonding[ active-backup ] mode
>
>Hi Atita,
>
>Is my understanding correct or do I miss something here? I just miss
how
>rtnl_lock could avoid this particular race condition.
>
>Any help would be highly useful.
>
>Thanks,
>Hari
>
>-----Original Message-----
>From: Hariharan Nagarajan -X (hanagara - HCL TECHNOLOGIES LIMITED at
>Cisco)
>Sent: Tuesday, September 27, 2011 1:25 PM
>To: 'Shirwaikar, Atita'; [email protected]
>Cc: Skidmore, Donald C
>Subject: RE: [E1000-devel] crash in ixgbe driver code during system
>reboot when operating in bonding[ active-backup ] mode
>
>Hi Atita,
>
>Thanks for your pointers. I looked at the bonding code and patch you
>pointed out which makes rtnl_lock synchronized version of
>bond_mii_monitor() .
>
>But when I looked at ixgbe driver code path for shutdown  , I don't
find
>rtnl_lock being used/held anywhere in this path.
>
>So just curious to understand how will rtnl_lock in bonding code will
>help to avoid this particular race condition.
>
>Thanks,
>Hari
>
>-----Original Message-----
>From: Shirwaikar, Atita [mailto:[email protected]]
>Sent: Saturday, September 24, 2011 3:04 AM
>To: [email protected]
>Cc: Skidmore, Donald C; Hariharan Nagarajan -X (hanagara - HCL
>TECHNOLOGIES LIMITED at Cisco)
>Subject: RE: [E1000-devel] crash in ixgbe driver code during system
>reboot when operating in bonding[ active-backup ] mode
>
>Hi Hari
>
>The issue that you are seeing, seems to have been fixed in the 2.6.24
>kernel.
>In summary , RTNL lock is now acquired by the bonding driver before it
>does any failover processing like changing the MAC address. This should
>prevent race conditions like the one you are seeing.
>
>The following patch did the fix
>http://kerneltrap.org/mailarchive/linux-netdev/2007/10/11/334752.
>
>
>Thanks
>Atita
>
>-----Original Message-----
>From: Skidmore, Donald C
>Sent: Friday, September 09, 2011 3:11 PM
>To: Shirwaikar, Atita
>Subject: FW: [E1000-devel] crash in ixgbe driver code during system
>reboot when operating in bonding[ active-backup ] mode
>
>This is the issue I was talking about.  I think the first step would be
>to try and recreate the failure.  He seemed to think it wasn't that
>difficult.  But then I haven't done it so ...   ;)
>
>Thanks,
>-Don
>
>-----Original Message-----
>From: Hariharan Nagarajan -X (hanagara - HCL TECHNOLOGIES LIMITED at
>Cisco) [mailto:[email protected]]
>Sent: Friday, September 09, 2011 1:26 AM
>To: Skidmore, Donald C; [email protected]
>Cc: Siddharth Vajirkar (svajirka)
>Subject: RE: [E1000-devel] crash in ixgbe driver code during system
>reboot when operating in bonding[ active-backup ] mode
>
>Hi Don,
>
>Thanks for your reply . While we have not tried later drivers, we are
>also not seeing the issue every time during reboot  because of
>timing nature of this race .
>
>But considering that from code inspection on these two paths, I think
it
>would be nice if we can address this in one of upcoming releases.
>
>Thanks,
>Hari
>
>-----Original Message-----
>From: Skidmore, Donald C [mailto:[email protected]]
>Sent: Friday, September 09, 2011 7:36 AM
>To: Hariharan Nagarajan -X (hanagara - HCL TECHNOLOGIES LIMITED at
>Cisco); [email protected]
>Cc: Siddharth Vajirkar (svajirka)
>Subject: RE: [E1000-devel] crash in ixgbe driver code during system
>reboot when operating in bonding[ active-backup ] mode
>
>Hi Hari,
>
>First thanks for the detailed analysis of your crash dump.  It does
look
>like a possible race condition between shutdown and the bonding device
>trying to update the address list.  We haven't seen a similar failure
>during our validation but we will try more targeted testing for this
his
>specific event.
>
>By any chance have you seen this failure with a more resent driver
>(3.4.24)?  Like I mentioned above we haven't seen this failure before
so
>I don't know of a fix in the later driver but a fair amount of the code
>affected here was refactored since the release you're running with.
>
>We will look into this failure and work to get it in to an upcoming
>release as well as in the kernel driver as soon as possible.  However I
>doubt it will make the next source forge release which I should be
>pushing up to Source Forge tomorrow.
>
>Thanks,
>-Don Skidmore <[email protected]>
>
>
>>-----Original Message-----
>>From: Hariharan Nagarajan -X (hanagara - HCL TECHNOLOGIES LIMITED at
>>Cisco) [mailto:[email protected]]
>>Sent: Thursday, September 08, 2011 5:03 AM
>>To: [email protected]
>>Cc: Siddharth Vajirkar (svajirka)
>>Subject: [E1000-devel] crash in ixgbe driver code during system reboot
>>when operating in bonding[ active-backup ] mode
>>
>>Hello all,
>>
>>
>>
>>We have application  running on Linux based OS. In one of our systems
>>which is based on 2.6.23 linux kernel, we have 10G Nics being  used
>with
>>in active-back bond mode to provide
>>
>>interface backup capability in case one of link fails.
>>
>>
>>
>>eth2 and eth3 are the system interfaces in bonding with active-backup
>>mode.
>>
>>
>>
>>OS: Linux based on 2.6.23 kernel version
>>
>>ixgbe driver version: 3.1.17
>>
>>
>>
>>The system is multiprocessor systems with ~ 24 cpu cores
>>
>>
>>
>>On particular instance while rebooting the system, we encountered a
>>kernel crash and system entered Kernel debugger mode due to double
>>fault.
>>
>>
>>
>>with dmesgs like this
>>
>>6>ACPI: PCI interrupt for device 0000:0a:00.0 disabled
>>
>><6>ACPI: PCI interrupt for device 0000:09:00.0 disabled
>>
>><3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 0
not
>>cleared within the polling period
>>
>><3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 1
not
>>cleared within the polling period
>>
>><3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 2
not
>>cleared within the polling period
>>
>><3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 3
not
>>cleared within the polling period
>>
>><3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 4
not
>>cleared within the polling period
>>
>><3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 5
not
>>cleared within the polling period
>>
>><3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 6
not
>>cleared within the polling period
>>
>><3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 7
not
>>cleared within the polling period
>>
>><3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 8
not
>>cleared within the polling period
>>
>><3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 9
not
>>cleared within the polling period
>>
>><3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 10
>not
>>cleared within the polling period
>>
>><3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 11
>not
>>cleared within the polling period
>>
>><3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 12
>not
>>cleared within the polling period
>>
>><3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 13
>not
>>cleared within the polling period
>>
>><3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 14
>not
>>cleared within the polling period
>>
>><3>ixgbe: eth3: ixgbe_disable_rx_queue: RXDCTL.ENABLE on Rx queue 15
>not
>>cleared within the polling period
>>
>><6>bonding: bond5: link status definitely down for interface eth3,
>>disabling it
>>
>><6>bonding: bond5: making interface eth2 the new active one.
>>
>><0>double fault: 0000 [1] SMP
>>
>>[0]kdb> bt
>>
>>Stack traceback for pid 0
>>
>>0xffffffff80945180        0        0  1    0   R  0xffffffff80945480
>>*swapper
>>
>>rsp                rip                Function (args)
>>
>>======================= <doublefault>
>>
>>kdb_bb: address 0xffffffffffffffff not recognised
>>
>>Using old style backtrace, unreliable with no arguments
>>
>>rsp                rip                Function (args)
>>
>>======================= <doublefault>
>>
>>0xffffffff80de8fd8 0xffffffff804ab4c3 ixgbe_clear_vmdq_generic+0x73
>>
>>+++ Cannot resolve next stack
>>
>>[0]kdb> cpu 1
>>
>>
>>
>>====================================
>>
>>
>>
>>While decoding the stack  on all cpus , we found  two cpus actively
>>doing ixgbe operation.
>>
>>
>>
>>Backtraces:
>>
>>  CPU 2:
>>
>>Stack traceback for pid 6328
>>
>>0xffff8101885fe000     6328     5757  1    2   R  0xffff8101885fe300
>>*reboot
>>
>>rsp                rip                Function (args)
>>
>>0xffff8102e1777c08 0xffffffff8071acc6 kdb_interrupt+0x66 (0x3a991,
>>0x3028, 0x7ddd3, 0x9e582c6e, 0x3028, 0xe028)
>>
>>0xffff8102e1777c68 0xffffffff803a81ce __delay+0xe (0x3a991)
>>
>>0xffff8102e1777ca0 0xffffffff803a8211 __const_udelay+0x31 (invalid)
>>
>>0xffff8102e1777cb0 0xffffffff804a95bc ixgbe_disable_pcie_master+0xac
>>(0xffff81062f97d280)
>>
>>0xffff8102e1777ce0 0xffffffff804b39dc ixgbe_reset_hw_82599+0x7c
>>(0xffff81062f97d280)
>>
>>0xffff8102e1777d10 0xffffffff804a8d4f ixgbe_init_hw_generic+0xf
>>(0xffff81062f97d280)
>>
>>0xffff8102e1777d30 0xffffffff804a2f43 ixgbe_reset+0x63
>>(0xffff81062f97c740)
>>
>>0xffff8102e1777d50 0xffffffff804a355b ixgbe_down+0x2cb
>>(0xffff81062f97c740)
>>
>>0xffff8102e1777d90 0xffffffff804a56d8 __ixgbe_shutdown+0xf8
>>(0xffff8103322f8800, 0xffff8102e1777ddf)
>>
>>0xffff8102e1777dd0 0xffffffff804a57c5 ixgbe_shutdown+0x15
>>(0xffff8103322f8800)
>>
>>0xffff8102e1777df0 0xffffffff803b8815 pci_device_shutdown+0x25
>(invalid)
>>
>>0xffff8102e1777e00 0xffffffff80422d8a device_shutdown+0x7a
>>
>>0xffff8102e1777e20 0xffffffff8024040c kernel_restart_prepare+0x2c
>>(invalid)
>>
>>0xffff8102e1777e30 0xffffffff80240431 kernel_restart+0x11 (0x0)
>>
>>0xffff8102e1777e50 0xffffffff80240726 sys_reboot+0x1d6 (invalid,
>>invalid, invalid, 0x0)
>>
>>0xffff8102e1777f80 0xffffffff80220542 ia32_sysret (invalid, invalid,
>>invalid, invalid)
>>
>>CPU 0
>>
>>(infinite loop, since IXGBE_WRITE_REG() is not taking effect), seems
to
>>be due to reset done by other CPU shown above
>>
>>
>>
>>0xffffffff80de6d08 0xffffffff804aa387 ixgbe_clear_rar_generic+0x77
>>
>>0xffffffff80de6d18 0xffffffff804ab4ca ixgbe_clear_vmdq_generic+0x7a
>>
>>0xffffffff80de6d28 0xffffffff804aa387 ixgbe_clear_rar_generic+0x77
>>
>>0xffffffff80de6d38 0xffffffff804ab4ca ixgbe_clear_vmdq_generic+0x7a
>>
>>0xffffffff80de6d48 0xffffffff804aa387 ixgbe_clear_rar_generic+0x77
>>
>>0xffffffff80de6d58 0xffffffff804a1b30 ixgbe_set_rx_mode+0x1a0
>>
>>0xffffffff80de6da8 0xffffffff80659b8c __dev_set_rx_mode+0x6c
>>
>>0xffffffff80de6dc8 0xffffffff8065ccba dev_mc_add+0x9a
>>
>>0xffffffff80de6e08 0xffffffff804bce33 bond_change_active_slave+0x143
>>
>>0xffffffff80de6e38 0xffffffff804bd1ab bond_select_active_slave+0x8b
>>
>>0xffffffff80de6e58 0xffffffff804bd40c bond_mii_monitor+0x1cc
>>
>>0xffffffff80de6e98 0xffffffff804bd240 bond_mii_monitor
>>
>>0xffffffff80de6eb8 0xffffffff8023b2b8 run_timer_softirq+0xe8
>>
>>0xffffffff80de6f08 0xffffffff802372e5 __do_softirq+0x75
>>
>>0xffffffff80de6f48 0xffffffff8020d3cc call_softirq+0x1c
>>
>>0xffffffff80de6f60 0xffffffff8020f329 do_softirq+0x49
>>
>>0xffffffff80de6f80 0xffffffff802373d5 irq_exit+0x45
>>
>>0xffffffff80de6f90 0xffffffff80219125 smp_apic_timer_interrupt+0x55
>>
>>0xffffffff80de6f98 0xffffffff8020a4b0 mwait_idle
>>
>>0xffffffff80de6fb0 0xffffffff8020ce76 apic_timer_interrupt+0x66
>>
>>======================= <normal>
>>
>>0xffffffff80a0be98 0xffffffff8020ce76 apic_timer_interrupt+0x66
>>
>>0xffffffff80a0bef8 0xffffffff8020a4f3 mwait_idle+0x43
>>
>>0xffffffff80a0bf20 0xffffffff8020a1d2 enter_idle+0x22
>>
>>0xffffffff80a0bf30 0xffffffff8020a448 cpu_idle+0x78
>>
>>0xffffffff80a0bf50 0xffffffff80717adc rest_init+0x5c
>>
>>0xffffffff80a0bf60 0xffffffff80a1894f start_kernel+0x29f
>>
>>This looks like an unfortunate interaction between reboot and the
>>bonding driver.
>>
>>One cpu is running reboot, and calls ixgbe down/reset, disabling the
>>ixgbe.
>>
>>Another cpu runs the bonding monitor, and is trying to change the
>>addresses
>>
>>on the bonding devices.  This ends up looping between
>>clear_vmda/clear_rar
>>
>>because the device (being shutdown) is not responding to the writes.
>>
>>It looks like some logic will be needed in order to interlock between
a
>>ixgbe
>>
>>device shutdown and attempting to rewrite the address list
>>
>>
>>
>>Is this an issue which is reported earlier? If not can this issue be
>>addressed in future ixgbe releases considering the race condition we
>>have
>>
>>in this execution path during shutdown?
>>
>>
>>
>>Thanks,
>>
>>Hari


------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Re: [E1000-devel] crash in ixgbe driver code during system reboot when operating in bonding[ active-backup ] mode

Reply via email to