On Tue, 2009-10-13 at 14:48 -0700, Joe Eykholt wrote:
> Vasu Dev wrote:
> > On Mon, 2009-10-12 at 11:39 -0700, Joe Eykholt wrote:
> >> This is with the patch I just submitted on top of fcoe-next.
> >> I believe it's unrelated to my patch.
> >>
> >> On removing the module or deleting an instance (I'm not sure which), the
> >> system crashes.  I think the problem is that fc_exch_release() happens
> >> too late, after the exchange manager is freed, so that the second arg
> >> to mempool_free, the pool pointer, is 6b6b6b6b6b6b6b6b, which is the
> >> slab allocator's free poison value.
> >>
> >> Symptoms with other allocators will probably be different, but I find the
> >> slab allocator with CONFIG_DEBUG_SLAB handy for finding things like this.
> >>
> >> We need to do a sync cancel somewhere before removing the exchange manager.
> > 
> > The fc_exch_reset needs to call cancel_delayed_work_sync() instead
> > cancel_delayed_work, we could not call _sync here since the
> > fc_exch_reset is called with lport lock held and calling _sync would
> > have acquired lport lock again causing deadlock. Currently lport exch
> > resp handler checks for -FC_EX_CLOSED without acquiring lport lock, so
> > now it should be safe to call cancel_delayed_work_sync in
> > fc_exch_reset(), let me try this fix.
> 
> The lock checking code still will not like it.
> The checker would see that the work item (response handler)
> grabs the lp_mutex in some cases for FC_EX_TIMEOUT, so it wouldn't
> like us holding lp_mutex during a cancel_sync.

We won't hit this case due FC_EX_RST_CLEANUP flag when timeout handler
is called from _sync context from fc_exch_reset. The timeout handler
already in progress would be in their own worker context, so this
shouldn't be problem here.

> 
> That's a real problem since the timeout handler may be trying to get
> the lock and we would wait for the handler to finish.
> 

As long as upper blocks of EM doesn't tries to acquire their lock on
EX_CLOSED event we are okay and  current code seems to not doing this
anymore.

> Another way might be to use a separate work queue for exchange timeouts,
> and flush all work on that queue synchronously before freeing an
> exchange manager.
> 

I tried that fix before and posted here for RFC
http://www.open-fcoe.org/pipermail/devel/2009-August/003743.html 

Yeah this should also work fine. I think my test code had issue in
testing above RFC patch as I was skipping cancel_delayed_work() in my
test while expecting flush_workqueue to clean up delayed scheduled work
still sitting in timer queue. Let me know if above fix works for you. I
also done with changes for fixing using _sync.

I think fix 1 is better in above RFC patch since other fix with _sync
adds more constraint on upper block to not acquire lock again and
required dropping _bh in acquiring exch lock in fc_exch_reset since
_sync might sleep. 

Which one do you think is better fix ?

Fix with _sync is pasted below, I'll test this more and review upper
block code for acquiring lport lock on _sync call.
 
diff --git a/drivers/scsi/libfc/fc_exch.c b/drivers/scsi/libfc/fc_exch.c
index 6e4adb8..8db9a00 100644
--- a/drivers/scsi/libfc/fc_exch.c
+++ b/drivers/scsi/libfc/fc_exch.c
@@ -1614,21 +1614,14 @@ static void fc_seq_ls_rjt(struct fc_seq *req_sp,
enum fc_els_rjt_reason reason,
 static void fc_exch_reset(struct fc_exch *ep)
 {
        struct fc_seq *sp;
-       void (*resp)(struct fc_seq *, struct fc_frame *, void *);
+       void (*resp)(struct fc_seq *, struct fc_frame *, void *) = NULL;
        void *arg;
        int rc = 1;

-       spin_lock_bh(&ep->ex_lock);
+       spin_lock(&ep->ex_lock);
        ep->state |= FC_EX_RST_CLEANUP;
-       /*
-        * we really want to call del_timer_sync, but cannot due
-        * to the lport calling with the lport lock held (some resp
-        * functions can also grab the lport lock which could cause
-        * a deadlock).
-        */
-       if (cancel_delayed_work(&ep->timeout_work))
-               atomic_dec(&ep->ex_refcnt);     /* drop hold for timer
*/
-       resp = ep->resp;
+       if (cancel_delayed_work_sync(&ep->timeout_work))
+               resp = ep->resp;
        ep->resp = NULL;
        if (ep->esb_stat & ESB_ST_REC_QUAL)
                atomic_dec(&ep->ex_refcnt);     /* drop hold for
rec_qual */
@@ -1636,7 +1629,7 @@ static void fc_exch_reset(struct fc_exch *ep)
        arg = ep->arg;
        sp = &ep->seq;
        rc = fc_exch_done_locked(ep);
-       spin_unlock_bh(&ep->ex_lock);
+       spin_unlock(&ep->ex_lock);
        if (!rc)
                fc_exch_delete(ep);


> <omitting log>
> 
>       Regards,
>       Joe

_______________________________________________
devel mailing list
[email protected]
http://www.open-fcoe.org/mailman/listinfo/devel

Reply via email to