Vasu Dev wrote:
> On Tue, 2009-10-13 at 14:48 -0700, Joe Eykholt wrote:
>> Vasu Dev wrote:
>>> On Mon, 2009-10-12 at 11:39 -0700, Joe Eykholt wrote:
>>>> This is with the patch I just submitted on top of fcoe-next.
>>>> I believe it's unrelated to my patch.
>>>>
>>>> On removing the module or deleting an instance (I'm not sure which), the
>>>> system crashes.  I think the problem is that fc_exch_release() happens
>>>> too late, after the exchange manager is freed, so that the second arg
>>>> to mempool_free, the pool pointer, is 6b6b6b6b6b6b6b6b, which is the
>>>> slab allocator's free poison value.
>>>>
>>>> Symptoms with other allocators will probably be different, but I find the
>>>> slab allocator with CONFIG_DEBUG_SLAB handy for finding things like this.
>>>>
>>>> We need to do a sync cancel somewhere before removing the exchange manager.
>>> The fc_exch_reset needs to call cancel_delayed_work_sync() instead
>>> cancel_delayed_work, we could not call _sync here since the
>>> fc_exch_reset is called with lport lock held and calling _sync would
>>> have acquired lport lock again causing deadlock. Currently lport exch
>>> resp handler checks for -FC_EX_CLOSED without acquiring lport lock, so
>>> now it should be safe to call cancel_delayed_work_sync in
>>> fc_exch_reset(), let me try this fix.
>> The lock checking code still will not like it.
>> The checker would see that the work item (response handler)
>> grabs the lp_mutex in some cases for FC_EX_TIMEOUT, so it wouldn't
>> like us holding lp_mutex during a cancel_sync.
> 
> We won't hit this case due FC_EX_RST_CLEANUP flag when timeout handler
> is called from _sync context from fc_exch_reset. The timeout handler
> already in progress would be in their own worker context, so this
> shouldn't be problem here.

I see.  What about the lock checking code, though?  It might still
print a warning about that.  Maybe there's a way of quieting it.

>> That's a real problem since the timeout handler may be trying to get
>> the lock and we would wait for the handler to finish.
>>
> 
> As long as upper blocks of EM doesn't tries to acquire their lock on
> EX_CLOSED event we are okay and  current code seems to not doing this
> anymore.
> 
>> Another way might be to use a separate work queue for exchange timeouts,
>> and flush all work on that queue synchronously before freeing an
>> exchange manager.
>>
> 
> I tried that fix before and posted here for RFC
> http://www.open-fcoe.org/pipermail/devel/2009-August/003743.html 
> 
> Yeah this should also work fine. I think my test code had issue in
> testing above RFC patch as I was skipping cancel_delayed_work() in my
> test while expecting flush_workqueue to clean up delayed scheduled work
> still sitting in timer queue. Let me know if above fix works for you. I
> also done with changes for fixing using _sync.
> 
> I think fix 1 is better in above RFC patch since other fix with _sync
> adds more constraint on upper block to not acquire lock again and
> required dropping _bh in acquiring exch lock in fc_exch_reset since
> _sync might sleep. 

Not only that, but the ex_lock *is* taken by fc_exch_timeout() and
is held while we're calling cancel_delayed_work_sync(), so that's
an issue.

> Which one do you think is better fix ?

I agree the RFC fix is better.

> Fix with _sync is pasted below, I'll test this more and review upper
> block code for acquiring lport lock on _sync call.
>  
> diff --git a/drivers/scsi/libfc/fc_exch.c b/drivers/scsi/libfc/fc_exch.c
> index 6e4adb8..8db9a00 100644
> --- a/drivers/scsi/libfc/fc_exch.c
> +++ b/drivers/scsi/libfc/fc_exch.c
> @@ -1614,21 +1614,14 @@ static void fc_seq_ls_rjt(struct fc_seq *req_sp,
> enum fc_els_rjt_reason reason,
>  static void fc_exch_reset(struct fc_exch *ep)
>  {
>         struct fc_seq *sp;
> -       void (*resp)(struct fc_seq *, struct fc_frame *, void *);
> +       void (*resp)(struct fc_seq *, struct fc_frame *, void *) = NULL;
>         void *arg;
>         int rc = 1;
> 
> -       spin_lock_bh(&ep->ex_lock);
> +       spin_lock(&ep->ex_lock);
>         ep->state |= FC_EX_RST_CLEANUP;
> -       /*
> -        * we really want to call del_timer_sync, but cannot due
> -        * to the lport calling with the lport lock held (some resp
> -        * functions can also grab the lport lock which could cause
> -        * a deadlock).
> -        */
> -       if (cancel_delayed_work(&ep->timeout_work))
> -               atomic_dec(&ep->ex_refcnt);     /* drop hold for timer
> */
> -       resp = ep->resp;
> +       if (cancel_delayed_work_sync(&ep->timeout_work))
> +               resp = ep->resp;
>         ep->resp = NULL;
>         if (ep->esb_stat & ESB_ST_REC_QUAL)
>                 atomic_dec(&ep->ex_refcnt);     /* drop hold for
> rec_qual */
> @@ -1636,7 +1629,7 @@ static void fc_exch_reset(struct fc_exch *ep)
>         arg = ep->arg;
>         sp = &ep->seq;
>         rc = fc_exch_done_locked(ep);
> -       spin_unlock_bh(&ep->ex_lock);
> +       spin_unlock(&ep->ex_lock);
>         if (!rc)
>                 fc_exch_delete(ep);
> 
> 
>> <omitting log>
>>
>>      Regards,
>>      Joe
> 

_______________________________________________
devel mailing list
[email protected]
http://www.open-fcoe.org/mailman/listinfo/devel

Reply via email to