On Tue, 2009-10-13 at 14:48 -0700, Joe Eykholt wrote: > Vasu Dev wrote: > > On Mon, 2009-10-12 at 11:39 -0700, Joe Eykholt wrote: > >> This is with the patch I just submitted on top of fcoe-next. > >> I believe it's unrelated to my patch. > >> > >> On removing the module or deleting an instance (I'm not sure which), the > >> system crashes. I think the problem is that fc_exch_release() happens > >> too late, after the exchange manager is freed, so that the second arg > >> to mempool_free, the pool pointer, is 6b6b6b6b6b6b6b6b, which is the > >> slab allocator's free poison value. > >> > >> Symptoms with other allocators will probably be different, but I find the > >> slab allocator with CONFIG_DEBUG_SLAB handy for finding things like this. > >> > >> We need to do a sync cancel somewhere before removing the exchange manager. > > > > The fc_exch_reset needs to call cancel_delayed_work_sync() instead > > cancel_delayed_work, we could not call _sync here since the > > fc_exch_reset is called with lport lock held and calling _sync would > > have acquired lport lock again causing deadlock. Currently lport exch > > resp handler checks for -FC_EX_CLOSED without acquiring lport lock, so > > now it should be safe to call cancel_delayed_work_sync in > > fc_exch_reset(), let me try this fix. > > The lock checking code still will not like it. > The checker would see that the work item (response handler) > grabs the lp_mutex in some cases for FC_EX_TIMEOUT, so it wouldn't > like us holding lp_mutex during a cancel_sync.
We won't hit this case due FC_EX_RST_CLEANUP flag when timeout handler is called from _sync context from fc_exch_reset. The timeout handler already in progress would be in their own worker context, so this shouldn't be problem here. > > That's a real problem since the timeout handler may be trying to get > the lock and we would wait for the handler to finish. > As long as upper blocks of EM doesn't tries to acquire their lock on EX_CLOSED event we are okay and current code seems to not doing this anymore. > Another way might be to use a separate work queue for exchange timeouts, > and flush all work on that queue synchronously before freeing an > exchange manager. > I tried that fix before and posted here for RFC http://www.open-fcoe.org/pipermail/devel/2009-August/003743.html Yeah this should also work fine. I think my test code had issue in testing above RFC patch as I was skipping cancel_delayed_work() in my test while expecting flush_workqueue to clean up delayed scheduled work still sitting in timer queue. Let me know if above fix works for you. I also done with changes for fixing using _sync. I think fix 1 is better in above RFC patch since other fix with _sync adds more constraint on upper block to not acquire lock again and required dropping _bh in acquiring exch lock in fc_exch_reset since _sync might sleep. Which one do you think is better fix ? Fix with _sync is pasted below, I'll test this more and review upper block code for acquiring lport lock on _sync call. diff --git a/drivers/scsi/libfc/fc_exch.c b/drivers/scsi/libfc/fc_exch.c index 6e4adb8..8db9a00 100644 --- a/drivers/scsi/libfc/fc_exch.c +++ b/drivers/scsi/libfc/fc_exch.c @@ -1614,21 +1614,14 @@ static void fc_seq_ls_rjt(struct fc_seq *req_sp, enum fc_els_rjt_reason reason, static void fc_exch_reset(struct fc_exch *ep) { struct fc_seq *sp; - void (*resp)(struct fc_seq *, struct fc_frame *, void *); + void (*resp)(struct fc_seq *, struct fc_frame *, void *) = NULL; void *arg; int rc = 1; - spin_lock_bh(&ep->ex_lock); + spin_lock(&ep->ex_lock); ep->state |= FC_EX_RST_CLEANUP; - /* - * we really want to call del_timer_sync, but cannot due - * to the lport calling with the lport lock held (some resp - * functions can also grab the lport lock which could cause - * a deadlock). - */ - if (cancel_delayed_work(&ep->timeout_work)) - atomic_dec(&ep->ex_refcnt); /* drop hold for timer */ - resp = ep->resp; + if (cancel_delayed_work_sync(&ep->timeout_work)) + resp = ep->resp; ep->resp = NULL; if (ep->esb_stat & ESB_ST_REC_QUAL) atomic_dec(&ep->ex_refcnt); /* drop hold for rec_qual */ @@ -1636,7 +1629,7 @@ static void fc_exch_reset(struct fc_exch *ep) arg = ep->arg; sp = &ep->seq; rc = fc_exch_done_locked(ep); - spin_unlock_bh(&ep->ex_lock); + spin_unlock(&ep->ex_lock); if (!rc) fc_exch_delete(ep); > <omitting log> > > Regards, > Joe _______________________________________________ devel mailing list [email protected] http://www.open-fcoe.org/mailman/listinfo/devel
