>-----Original Message----- >From: Mike Christie [mailto:[EMAIL PROTECTED] >Sent: Tuesday, August 19, 2008 3:24 AM > >Actually :)... forget that patch. Attached is a new one >fix-my-f-ups2.patch that fixes more possible bugs. I also stopped >abusing the ESB bits and defined internal state bits. >
The same bug of system crash after "WARNING: at net/ipv4/tcp_timer.c:290" occurred after applying fix-my-f-ups2.patch. I root caused this bug to passing up abort response to fcp while still freeing abort response frame as per older abort code in fc_exch_abts_resp() caused double frame free for eth driver allocated skb buffer and that likely messed up my networking stack. This explains ssh session hanged first and then system hanged after network skb pool messed up. Attached patch fixed that double frame free in fcp abort path and with this fix your 14 patches + fix-my-f-ups2.patch set works fine. I'm applying these patches to git tree. >The bugs: > >1. If fc_exch_rrq_resp is called because mgr reset is run, then we will >not drop the hold taken when the recov qual was found in the abts >response. The check for the qual in fc_exch_reset only handles if the >timer function has not sent the rrq yet. I added a ref to the aborted_ep >to the ep so we could make sure this gets cleaned up. > Where did you add ref for aborted_ep ? Is it in addition to previously set exch ref along with setting ESB_ST_REC_QUAL ? >2. If fc_exch_rrq's call to fc_exch_seq_send failed, then it looks like >we want to retry the rrq, so we need to reset the ESB_ST_REC_QUAL bit. > I couldn't understand this bug. Currently fc_exch_rrq's set this bit if fc_exch_seq_send fails and then exch timer to retry rrq. The fc_exch_timeout() or fc_exch_reset() will reset this bit again. However I do see issues in resetting this bit before call to fc_exch_rrq and possible race with fc_exch_reset(), I guess this can be avoided using exch lock through out fc_exch_rrq and reset this bit when rrq is done. >3. If the timer function is sending the rrq, the ep for the rrq could >already be on the ex_list. If fc_exch_reset runs and sees it on the >list, then it would call fc_exch_done_locked (releasing the hold from >the timer functions alloc) on the ep while the timer function was still >accessing the ep. fc_exch_reset could then exit and we would release >fc_exch_mgr_reset's hold which would free the ep while the timer >function was still using it. > The exch lock throughout fc_exch_rrq will help here. >We really want to make sure the ep that we are sending the rrq for is >cleaned up first and that we use del_timer_sync to make sure the timer >fucntion is done accessing the rrq's ep. We cannot use del_timer_sync >because the lport lock is held in some cases. Do we need to have that >lock held? > The local port lock is also not good given exch resp gets called from reset which also tries to grap local port lock. So guess we have to ensure no local port lock when calling fc_exch_mgr_reset. Rob can elaborate more on local port locking policy. >I tried to close some of the races by setting FC_EX_RST_CLEANUP and then >checking for that in fc_exch_timeout and fc_exch_timer_set_locked so >would not at least start a new timer referencing bad memory. > >It seems like we may need a more generic fix to this problem where the >fc_exch_mgr_reset can free a ep while someone is sending a sequence for >it. For fc_fcp.c we are safe because fc_fcp.c holds the scsi pkt lock >when we send IO and in the resp handler, but I am not sure about other >places. Are there more like with the rrq path? I think if we use policy of using upper layer locks in send and receive (ep->resp) via fc_exch.c but no upper layer lock in fc_exch_mgr_reset() then that should work well with fc_exch_reset() for race issues with fc_exch_reset(). I think we have been doing this mostly.
_______________________________________________ devel mailing list [email protected] http://www.open-fcoe.org/mailman/listinfo/devel
