We ran into this bug before. Please see bug 6584724. Note that the stack is a little different because of the code changes during review time.
Thanks - Cathy > During IPMP testing, I hit an interesting deadlock between softmac/GLDv3 > and ce. Thread 1 grabbed di_lock as RW_WRITER (via dls_multicst_remove()), > sent a DL_DISABMULTI_REQ downstream, and is blocked waiting for an ACK: > > stack pointer for thread 2a10046fca0: 2a10046eda1 > [ 000002a10046eda1 cv_timedwait+0x8c() ] > 000002a10046ee51 softmac_output+0x80() > 000002a10046ef01 mac_multicst_remove+0xc4() > 000002a10046efb1 dls_multicst_remove+0x60() > 000002a10046f061 proto_disabmulti_req+0xbc() > 000002a10046f111 dld_wput_nondata_task+0xf0() > 000002a10046f1c1 taskq_d_thread+0xbc() > 000002a10046f291 thread_start+4() > > Thread 2 is an interrupt that happened to come in after thread 1 grabbed > di_lock but before the DL_DISABMULTI_REQ was handled by ce. Inside the > ce_intr() logic, it grabbed a lock as RW_READER and called putnext(). > It's blocked in dls_accept() trying to acquire di_lock as RW_READER: > > stack pointer for thread 2a10007fca0: 2a10007e191 > [ 000002a10007e191 turnstile_block+0x5a4() ] > 000002a10007e241 rw_enter_sleep+0x168() > 000002a10007e2f1 dls_accept+0x1c() > 000002a10007e3a1 i_dls_link_rx+0x260() > 000002a10007e4d1 mac_do_rx+0xb0() > 000002a10007e581 putnext+0x3f4() > 000002a10007e631 ce_intr+0x1a8c() > 000002a10007f1d1 pci_intr_wrapper+0xe8() > 000002a10007f291 intr_thread+0x2b8() > > Thread 3 is the taskq handling the DL_DISABMULTI_REQ. It's trying to > acquire the aforementioned ce lock as RW_WRITER, but is blocked because > thread 2 holds it as RW_READER: > > stack pointer for thread 2a100157ca0: 2a100156691 > [ 000002a100156691 turnstile_block+0x5a4() ] > 000002a100156741 rw_enter_sleep+0x1b0() > 000002a1001567f1 ce_dmreq+0xc8() > 000002a1001568b1 ce_proto+0x1d8() > 000002a100156961 ce_wsrv+0x2d30() > 000002a100157061 runservice+0x6c() > 000002a100157111 stream_service+0x190() > 000002a1001571c1 taskq_d_thread+0xbc() > 000002a100157291 thread_start+4() > > So, T1 is blocked waiting for T3, T3 is blocked waiting for T2, and T2 is > blocked waiting for T1. Seems like the right fix is to change ce not to > hold a lock across putnext(), but that may be a high-risk change and there > may be other legacy drivers that have a similar flaw. So I'm interested > to hear from Thiru on whether his new GLDv3 locking design would also > resolve this deadlock. >
