Vasu Dev wrote: > On Tue, 2009-10-13 at 14:48 -0700, Joe Eykholt wrote: >> Vasu Dev wrote: >>> On Mon, 2009-10-12 at 11:39 -0700, Joe Eykholt wrote: >>>> This is with the patch I just submitted on top of fcoe-next. >>>> I believe it's unrelated to my patch. >>>> >>>> On removing the module or deleting an instance (I'm not sure which), the >>>> system crashes. I think the problem is that fc_exch_release() happens >>>> too late, after the exchange manager is freed, so that the second arg >>>> to mempool_free, the pool pointer, is 6b6b6b6b6b6b6b6b, which is the >>>> slab allocator's free poison value. >>>> >>>> Symptoms with other allocators will probably be different, but I find the >>>> slab allocator with CONFIG_DEBUG_SLAB handy for finding things like this. >>>> >>>> We need to do a sync cancel somewhere before removing the exchange manager. >>> The fc_exch_reset needs to call cancel_delayed_work_sync() instead >>> cancel_delayed_work, we could not call _sync here since the >>> fc_exch_reset is called with lport lock held and calling _sync would >>> have acquired lport lock again causing deadlock. Currently lport exch >>> resp handler checks for -FC_EX_CLOSED without acquiring lport lock, so >>> now it should be safe to call cancel_delayed_work_sync in >>> fc_exch_reset(), let me try this fix. >> The lock checking code still will not like it. >> The checker would see that the work item (response handler) >> grabs the lp_mutex in some cases for FC_EX_TIMEOUT, so it wouldn't >> like us holding lp_mutex during a cancel_sync. > > We won't hit this case due FC_EX_RST_CLEANUP flag when timeout handler > is called from _sync context from fc_exch_reset. The timeout handler > already in progress would be in their own worker context, so this > shouldn't be problem here.
I see. What about the lock checking code, though? It might still print a warning about that. Maybe there's a way of quieting it. >> That's a real problem since the timeout handler may be trying to get >> the lock and we would wait for the handler to finish. >> > > As long as upper blocks of EM doesn't tries to acquire their lock on > EX_CLOSED event we are okay and current code seems to not doing this > anymore. > >> Another way might be to use a separate work queue for exchange timeouts, >> and flush all work on that queue synchronously before freeing an >> exchange manager. >> > > I tried that fix before and posted here for RFC > http://www.open-fcoe.org/pipermail/devel/2009-August/003743.html > > Yeah this should also work fine. I think my test code had issue in > testing above RFC patch as I was skipping cancel_delayed_work() in my > test while expecting flush_workqueue to clean up delayed scheduled work > still sitting in timer queue. Let me know if above fix works for you. I > also done with changes for fixing using _sync. > > I think fix 1 is better in above RFC patch since other fix with _sync > adds more constraint on upper block to not acquire lock again and > required dropping _bh in acquiring exch lock in fc_exch_reset since > _sync might sleep. Not only that, but the ex_lock *is* taken by fc_exch_timeout() and is held while we're calling cancel_delayed_work_sync(), so that's an issue. > Which one do you think is better fix ? I agree the RFC fix is better. > Fix with _sync is pasted below, I'll test this more and review upper > block code for acquiring lport lock on _sync call. > > diff --git a/drivers/scsi/libfc/fc_exch.c b/drivers/scsi/libfc/fc_exch.c > index 6e4adb8..8db9a00 100644 > --- a/drivers/scsi/libfc/fc_exch.c > +++ b/drivers/scsi/libfc/fc_exch.c > @@ -1614,21 +1614,14 @@ static void fc_seq_ls_rjt(struct fc_seq *req_sp, > enum fc_els_rjt_reason reason, > static void fc_exch_reset(struct fc_exch *ep) > { > struct fc_seq *sp; > - void (*resp)(struct fc_seq *, struct fc_frame *, void *); > + void (*resp)(struct fc_seq *, struct fc_frame *, void *) = NULL; > void *arg; > int rc = 1; > > - spin_lock_bh(&ep->ex_lock); > + spin_lock(&ep->ex_lock); > ep->state |= FC_EX_RST_CLEANUP; > - /* > - * we really want to call del_timer_sync, but cannot due > - * to the lport calling with the lport lock held (some resp > - * functions can also grab the lport lock which could cause > - * a deadlock). > - */ > - if (cancel_delayed_work(&ep->timeout_work)) > - atomic_dec(&ep->ex_refcnt); /* drop hold for timer > */ > - resp = ep->resp; > + if (cancel_delayed_work_sync(&ep->timeout_work)) > + resp = ep->resp; > ep->resp = NULL; > if (ep->esb_stat & ESB_ST_REC_QUAL) > atomic_dec(&ep->ex_refcnt); /* drop hold for > rec_qual */ > @@ -1636,7 +1629,7 @@ static void fc_exch_reset(struct fc_exch *ep) > arg = ep->arg; > sp = &ep->seq; > rc = fc_exch_done_locked(ep); > - spin_unlock_bh(&ep->ex_lock); > + spin_unlock(&ep->ex_lock); > if (!rc) > fc_exch_delete(ep); > > >> <omitting log> >> >> Regards, >> Joe > _______________________________________________ devel mailing list [email protected] http://www.open-fcoe.org/mailman/listinfo/devel
