Re: [Open-FCoE] [RFC PATCH 1/2] libfc: removed checks for timer_pending and then updating exch ref count.

Mike Christie Sun, 10 Aug 2008 18:51:32 -0700

Mike Christie wrote:
> Mike Christie wrote:
>> Dev, Vasu wrote:
>>>> The locking and refcount use is really weird now though. Would it be
>>>> cleaner or more standard to just say that you need to always grab the
>>>> lock when accessing the ex_refcount instead of sometimes doing it and
>>>> sometimes not (too confusing).
>>> Currently exch ref count can be changed(inc/dec) without having need to hold
>>> exch lock, but this can be revised as suggested above to always have exch
>>> lock for changing exch ref. However as you mentioned in other email that
>>> exch ref and locking is still nicer than fsp pkt locking and doesn't need
>>> expensive del_timer_sync() as used in FCP.
>> I am not sure what you mean by combining the two issues? For the FCP 
>> path we do not need the pkt lock for the actual release or hold of the 
>> refcount. We actually nornally never have the pkt lock when we do the 
>> final release, because fc_fcp_unlock_pkt drops the lock then drops the 
>> refcount.
>>
>> My comment about the del timer and refcounts is just about not needing 
>> to use the del_timer_sync with the exahnge code because the timer 
>> operation should always have a refcount.
>>
>> My comment about taking the lock around the exhange hold/release is just 
>> about what the locks and refcounts are protecting? I mean it seems like 
>> we are trying to do the two operations:
>>
>> exchange mp->exches lookup + hold
>> and
>> mp->exches clear + release
>>
>> atomically. We currently do both the lookup and hold under the 
>> mp->em_lock, but we do the release without the lock and take it only if 
>> the refcount is zero so we can clear the mp->exches array among other 
>> things. And this is racey because it seems like we want do drop the 
>> refcount and clear the mp->exches atomically.
>>
>> So with your patch we now drop the count without the lock, but if zero 
>> grab the lock to check the refcont again which seems weird because 
>> normally when a refcount goes to zero it means that we are going to free 
>> it. I mean if we were strict and we used krefs this would not be 
>> possible, and if you tried to send a patch for kref_put to do an extra 
>> check it is easier to imagine Christoph telling us whatever we are 
>> trying to do is crap over Greg happily taking it.
>>
> 
> After I thought this over the night, I think both solutions are equally 
> ugly :) so maybe there is something more fundamentally wrong with what 
> we are doing.
> 
> I am trying to fix the abort handling and I think that interacts with 
> this. Right now we can go from fc_fcp_timeout to fc_timeout_error which 
> calls seq_exch_abort then calls fc_io_compl and that calls scsi_done to 
> return the command to the scsi layer. The problem with this is that 
> after we have sent the abort we do not know the status of it. If it is a 
> tape command scsi ml is not going to retry, but we really would like to 
> tell st.c what is up with the command (if it got aborted or not). And 
> for disk commands, if scsi-ml decides to kill it then we want to make 
> sure the device is still not writing out data, when it is returned to 
> the FS.
> 
> Also for the fc_fcp_timeout->fc_timeout_error path, if the seq_ptr is 
> set we never call exch_done to clean up the response handler. 
> fc_io_compl then does the release on the fsp from the initial allocation 
> so if we ever do get a response the ep->resp is going to be accessing a 
> freed (or possible reallocated to some other command) fsp.
> 
> Or if we have sent the abort, and then fc_fcp_recv gets a response for 
> the scsi command that indicates it is done, then we call exch_done and 
> if fc_exch_release frees the ep before the abort response comes in 
> fc_exch.c does not get the abort response (which is not needed today 
> because fc_eh_abort's response code does not exist and always timesout) 
> but will be needed below), so any waiters are left timing out. 
> fc_fcp_complete could actually check for this too scenario.
> 
> So I think for the abort code we want to do a hold on the exchange if it 
> is sent, and fc_fcp.c wants to get a response. In fc_exch.c we want to 
> do a hold on the ep in exch_abort. Then in the fc_fcp.c abort response 
> handler, we can do a exch_done for the abort to release the 
> seq_send_abort's hold. If the abort completely cleaned up the exchange 
> fc_fcp.c could do another exch_done to release the hold that was done 
> for the scsi command.
> 
> exch_done would then need to be modified so that instead of 
> fc_exch_release doing
> 
>                  spin_lock_bh(&mp->em_lock);
>                  WARN_ON(mp->total_exches <= 0);
>                  mp->total_exches--;
>                  mp->exches[ep->xid - mp->min_xid] = NULL;
>                  list_del(&ep->ex_list);
>                  spin_unlock_bh(&mp->em_lock);
> 
> exch_done would do that, then release its hold. And actually exch_done 
> would need to check that if there were multiple ep->resp functions (one 
> from fc_fcp_send_cmd and one from the abort) then we do not do the mp 
> manipulation above until the exch_done has been called for the last 
> ep->resp.
> 
> There are more corner cases like with all fun error handler, but I guess 
> the main idea is that we would be moving the mp manipulation to 
> exch_done instead of doing it in the release function. And exch_done 
> would do the release after the mp manipulation, so we would not hit that 
> race. The abort has a hold on the ep which is released by the abort 
> caller when it calls exch_done. fc_fcp.c gets its abort responses and 
> the fc_timeout_error path does not leave dangling ep->resps.
> 
> What do you think?


Oh yeah, my comments on this patch can be ignored for the merge of the 
patch since it is almost al timer clean ups in here. I will send a patch 
over your two patches.
_______________________________________________
devel mailing list
[email protected]
http://www.open-fcoe.org/mailman/listinfo/devel

Re: [Open-FCoE] [RFC PATCH 1/2] libfc: removed checks for timer_pending and then updating exch ref count.

Reply via email to