During IPMP testing, I hit an interesting deadlock between softmac/GLDv3
and ce. Thread 1 grabbed di_lock as RW_WRITER (via dls_multicst_remove()),
sent a DL_DISABMULTI_REQ downstream, and is blocked waiting for an ACK:
stack pointer for thread 2a10046fca0: 2a10046eda1
[ 000002a10046eda1 cv_timedwait+0x8c() ]
000002a10046ee51 softmac_output+0x80()
000002a10046ef01 mac_multicst_remove+0xc4()
000002a10046efb1 dls_multicst_remove+0x60()
000002a10046f061 proto_disabmulti_req+0xbc()
000002a10046f111 dld_wput_nondata_task+0xf0()
000002a10046f1c1 taskq_d_thread+0xbc()
000002a10046f291 thread_start+4()
Thread 2 is an interrupt that happened to come in after thread 1 grabbed
di_lock but before the DL_DISABMULTI_REQ was handled by ce. Inside the
ce_intr() logic, it grabbed a lock as RW_READER and called putnext().
It's blocked in dls_accept() trying to acquire di_lock as RW_READER:
stack pointer for thread 2a10007fca0: 2a10007e191
[ 000002a10007e191 turnstile_block+0x5a4() ]
000002a10007e241 rw_enter_sleep+0x168()
000002a10007e2f1 dls_accept+0x1c()
000002a10007e3a1 i_dls_link_rx+0x260()
000002a10007e4d1 mac_do_rx+0xb0()
000002a10007e581 putnext+0x3f4()
000002a10007e631 ce_intr+0x1a8c()
000002a10007f1d1 pci_intr_wrapper+0xe8()
000002a10007f291 intr_thread+0x2b8()
Thread 3 is the taskq handling the DL_DISABMULTI_REQ. It's trying to
acquire the aforementioned ce lock as RW_WRITER, but is blocked because
thread 2 holds it as RW_READER:
stack pointer for thread 2a100157ca0: 2a100156691
[ 000002a100156691 turnstile_block+0x5a4() ]
000002a100156741 rw_enter_sleep+0x1b0()
000002a1001567f1 ce_dmreq+0xc8()
000002a1001568b1 ce_proto+0x1d8()
000002a100156961 ce_wsrv+0x2d30()
000002a100157061 runservice+0x6c()
000002a100157111 stream_service+0x190()
000002a1001571c1 taskq_d_thread+0xbc()
000002a100157291 thread_start+4()
So, T1 is blocked waiting for T3, T3 is blocked waiting for T2, and T2 is
blocked waiting for T1. Seems like the right fix is to change ce not to
hold a lock across putnext(), but that may be a high-risk change and there
may be other legacy drivers that have a similar flaw. So I'm interested
to hear from Thiru on whether his new GLDv3 locking design would also
resolve this deadlock.
--
meem