On 30 July 2014 17:46, Ben Greear <[email protected]> wrote: > Not sure how relevant this is to upstream, but just in case someone > wants to look at it: > > Kernel is modified 3.14.14+, with a good bit of backported ath10k and some > patches of my own to help stabilize ath10k with my workload and to support > CT firmware features. > > http://dmz2.candelatech.com/git/gitweb.cgi?p=linux-3.14.dev.y/.git;a=summary > > Firmware is CT firmware, and it has a bug in this test case where it crashes > fairly often upon removal of a vdev after some traffic tests have been > running. Likely this firmware bug is something that I have added or > at least exacerbated, and I am working on fixing it. > > But, when it crashes, it takes the kernel down shortly afterwards > in a reliable manner: [...] > BUG: unable to handle kernel NULL pointer dereference at 0000000000000068 > IP: [<ffffffffa06a318d>] ath10k_txrx_tx_unref+0x91/0x3c7 [ath10k_core] [...] > Call Trace: > [<ffffffffa06a28b4>] ath10k_htt_tx_detach+0x70/0xd1 [ath10k_core] > [<ffffffffa06a04cf>] ath10k_htt_detach+0x16/0x1b [ath10k_core] > [<ffffffffa069eab3>] ath10k_core_stop+0x4f/0x70 [ath10k_core] > [<ffffffffa069ae32>] ath10k_halt+0xde/0x161 [ath10k_core] > [<ffffffffa069aeed>] ath10k_stop+0x38/0x89 [ath10k_core] > [<ffffffffa05b0ae6>] ieee80211_stop_device+0x58/0x84 [mac80211] > [<ffffffffa069541c>] ? spin_lock_bh+0x9/0xb [ath10k_core] > [<ffffffffa059d0d3>] ieee80211_do_stop+0x625/0x67d [mac80211] > [<ffffffff810fdf6a>] ? trace_hardirqs_on+0xd/0xf > [<ffffffff810c6d42>] ? __local_bh_enable_ip+0xaf/0xd9 > [<ffffffff815d8156>] ? _raw_spin_unlock_bh+0x31/0x35 > [<ffffffff8153a693>] ? dev_deactivate_many+0x129/0x172 > [<ffffffffa059d140>] ieee80211_stop+0x15/0x19 [mac80211] [...] > (gdb) l *(ath10k_txrx_tx_unref+0x91) > 0xe18d is in ath10k_txrx_tx_unref > (/mnt/sda/home/greearb/git/linux-3.14.dev.y/drivers/net/wireless/ath/ath10k/txrx.c:109). > 104 } > 105 > 106 msdu = htt->pending_tx[tx_done->msdu_id]; > 107 skb_cb = ATH10K_SKB_CB(msdu); > 108 > 109 dma_unmap_single(dev, skb_cb->paddr, msdu->len, > DMA_TO_DEVICE);
Okay.. So `msdu` is NULL. I can't seem to find unpaired used_msdu_ids and pending_tx accesses. This suggests htt->pending_tx itself is invalid (as well as used_msdu_ids) - perhaps use-after-free (both pointers aren't NULLed). This in turn suggests ath10k_htt_tx_detach() was called before and this is the second call. Stack trace suggests the (allegadly second) call originates from drv_stop(). When ath10k crashes ath10k_core_start() worker calls ath10k_halt() directly, sets RESTARTING state and queues mac80211 hw restart. ath10k_stop() calls ath10k_halt() only if state is ON, RESTARTED or WEDGED. RESTARTING isn't one of them, but since you have more than 1 entry point for hw recovery (pci indication, wmi_send, flush) you can trigger ath10k_core_start() worker with RESTARTING state (i.e. crash within a crash before ath10k_start() is called) which changes state to WEDGED. WEDGED allows ath10k_halt() to be called in ath10k_stop(). QED. The following (it has been in upstream for some time now) should fix the problem: commit c5058f5b82f226b236dc5a65015152ed3c23efff Author: Michal Kazior <[email protected]> Date: Mon May 26 12:46:03 2014 +0300 ath10k: perform hw restart lazily This reduces risk of races and prepares for more hw restart fixes. It also makes sense to perform teardown after mac80211 starts its restart routine as it guarantees it has stopped itself by then (including tx queues). Signed-off-by: Michal Kazior <[email protected]> Signed-off-by: Kalle Valo <[email protected]> This probably makes your ieee80211_stop_queues() in ath10k_halt() obsolete too. MichaĆ _______________________________________________ ath10k mailing list [email protected] http://lists.infradead.org/mailman/listinfo/ath10k
