Mike Christie wrote: > Mike Christie wrote: >> Dev, Vasu wrote: >>>> The locking and refcount use is really weird now though. Would it be >>>> cleaner or more standard to just say that you need to always grab the >>>> lock when accessing the ex_refcount instead of sometimes doing it and >>>> sometimes not (too confusing). >>> Currently exch ref count can be changed(inc/dec) without having need to hold >>> exch lock, but this can be revised as suggested above to always have exch >>> lock for changing exch ref. However as you mentioned in other email that >>> exch ref and locking is still nicer than fsp pkt locking and doesn't need >>> expensive del_timer_sync() as used in FCP. >> I am not sure what you mean by combining the two issues? For the FCP >> path we do not need the pkt lock for the actual release or hold of the >> refcount. We actually nornally never have the pkt lock when we do the >> final release, because fc_fcp_unlock_pkt drops the lock then drops the >> refcount. >> >> My comment about the del timer and refcounts is just about not needing >> to use the del_timer_sync with the exahnge code because the timer >> operation should always have a refcount. >> >> My comment about taking the lock around the exhange hold/release is just >> about what the locks and refcounts are protecting? I mean it seems like >> we are trying to do the two operations: >> >> exchange mp->exches lookup + hold >> and >> mp->exches clear + release >> >> atomically. We currently do both the lookup and hold under the >> mp->em_lock, but we do the release without the lock and take it only if >> the refcount is zero so we can clear the mp->exches array among other >> things. And this is racey because it seems like we want do drop the >> refcount and clear the mp->exches atomically. >> >> So with your patch we now drop the count without the lock, but if zero >> grab the lock to check the refcont again which seems weird because >> normally when a refcount goes to zero it means that we are going to free >> it. I mean if we were strict and we used krefs this would not be >> possible, and if you tried to send a patch for kref_put to do an extra >> check it is easier to imagine Christoph telling us whatever we are >> trying to do is crap over Greg happily taking it. >> > > After I thought this over the night, I think both solutions are equally > ugly :) so maybe there is something more fundamentally wrong with what > we are doing. > > I am trying to fix the abort handling and I think that interacts with > this. Right now we can go from fc_fcp_timeout to fc_timeout_error which > calls seq_exch_abort then calls fc_io_compl and that calls scsi_done to > return the command to the scsi layer. The problem with this is that > after we have sent the abort we do not know the status of it. If it is a > tape command scsi ml is not going to retry, but we really would like to > tell st.c what is up with the command (if it got aborted or not). And > for disk commands, if scsi-ml decides to kill it then we want to make > sure the device is still not writing out data, when it is returned to > the FS. > > Also for the fc_fcp_timeout->fc_timeout_error path, if the seq_ptr is > set we never call exch_done to clean up the response handler. > fc_io_compl then does the release on the fsp from the initial allocation > so if we ever do get a response the ep->resp is going to be accessing a > freed (or possible reallocated to some other command) fsp. > > Or if we have sent the abort, and then fc_fcp_recv gets a response for > the scsi command that indicates it is done, then we call exch_done and > if fc_exch_release frees the ep before the abort response comes in > fc_exch.c does not get the abort response (which is not needed today > because fc_eh_abort's response code does not exist and always timesout) > but will be needed below), so any waiters are left timing out. > fc_fcp_complete could actually check for this too scenario. > > So I think for the abort code we want to do a hold on the exchange if it > is sent, and fc_fcp.c wants to get a response. In fc_exch.c we want to > do a hold on the ep in exch_abort. Then in the fc_fcp.c abort response > handler, we can do a exch_done for the abort to release the > seq_send_abort's hold. If the abort completely cleaned up the exchange > fc_fcp.c could do another exch_done to release the hold that was done > for the scsi command. > > exch_done would then need to be modified so that instead of > fc_exch_release doing > > spin_lock_bh(&mp->em_lock); > WARN_ON(mp->total_exches <= 0); > mp->total_exches--; > mp->exches[ep->xid - mp->min_xid] = NULL; > list_del(&ep->ex_list); > spin_unlock_bh(&mp->em_lock); > > exch_done would do that, then release its hold. And actually exch_done > would need to check that if there were multiple ep->resp functions (one > from fc_fcp_send_cmd and one from the abort) then we do not do the mp > manipulation above until the exch_done has been called for the last > ep->resp. > > There are more corner cases like with all fun error handler, but I guess > the main idea is that we would be moving the mp manipulation to > exch_done instead of doing it in the release function. And exch_done > would do the release after the mp manipulation, so we would not hit that > race. The abort has a hold on the ep which is released by the abort > caller when it calls exch_done. fc_fcp.c gets its abort responses and > the fc_timeout_error path does not leave dangling ep->resps. > > What do you think?
Oh yeah, my comments on this patch can be ignored for the merge of the patch since it is almost al timer clean ups in here. I will send a patch over your two patches. _______________________________________________ devel mailing list [email protected] http://www.open-fcoe.org/mailman/listinfo/devel
