We ran into this bug before. Please see bug 6584724. Note that the stack is a 
little 
different because of the code changes during review time.

Thanks
- Cathy

> During IPMP testing, I hit an interesting deadlock between softmac/GLDv3
> and ce.  Thread 1 grabbed di_lock as RW_WRITER (via dls_multicst_remove()),
> sent a DL_DISABMULTI_REQ downstream, and is blocked waiting for an ACK:
> 
>   stack pointer for thread 2a10046fca0: 2a10046eda1
>   [ 000002a10046eda1 cv_timedwait+0x8c() ]
>     000002a10046ee51 softmac_output+0x80()
>     000002a10046ef01 mac_multicst_remove+0xc4()
>     000002a10046efb1 dls_multicst_remove+0x60()
>     000002a10046f061 proto_disabmulti_req+0xbc()
>     000002a10046f111 dld_wput_nondata_task+0xf0()
>     000002a10046f1c1 taskq_d_thread+0xbc()
>     000002a10046f291 thread_start+4()
> 
> Thread 2 is an interrupt that happened to come in after thread 1 grabbed
> di_lock but before the DL_DISABMULTI_REQ was handled by ce.  Inside the
> ce_intr() logic, it grabbed a lock as RW_READER and called putnext().
> It's blocked in dls_accept() trying to acquire di_lock as RW_READER:
> 
>   stack pointer for thread 2a10007fca0: 2a10007e191
>   [ 000002a10007e191 turnstile_block+0x5a4() ]
>     000002a10007e241 rw_enter_sleep+0x168()
>     000002a10007e2f1 dls_accept+0x1c()
>     000002a10007e3a1 i_dls_link_rx+0x260()
>     000002a10007e4d1 mac_do_rx+0xb0()
>     000002a10007e581 putnext+0x3f4()    
>     000002a10007e631 ce_intr+0x1a8c()
>     000002a10007f1d1 pci_intr_wrapper+0xe8()
>     000002a10007f291 intr_thread+0x2b8()
> 
> Thread 3 is the taskq handling the DL_DISABMULTI_REQ.  It's trying to
> acquire the aforementioned ce lock as RW_WRITER, but is blocked because
> thread 2 holds it as RW_READER:
> 
>   stack pointer for thread 2a100157ca0: 2a100156691
>   [ 000002a100156691 turnstile_block+0x5a4() ]
>     000002a100156741 rw_enter_sleep+0x1b0()
>     000002a1001567f1 ce_dmreq+0xc8()
>     000002a1001568b1 ce_proto+0x1d8()
>     000002a100156961 ce_wsrv+0x2d30()
>     000002a100157061 runservice+0x6c()
>     000002a100157111 stream_service+0x190()
>     000002a1001571c1 taskq_d_thread+0xbc()
>     000002a100157291 thread_start+4()
> 
> So, T1 is blocked waiting for T3, T3 is blocked waiting for T2, and T2 is
> blocked waiting for T1.  Seems like the right fix is to change ce not to
> hold a lock across putnext(), but that may be a high-risk change and there
> may be other legacy drivers that have a similar flaw.  So I'm interested
> to hear from Thiru on whether his new GLDv3 locking design would also
> resolve this deadlock.
> 


Reply via email to