In case of i/f destroy all exches must be freed before its EM mempool destroyed but currently some exches could be still releasing in their scheduled work while EM mempool destroy called.
Fixing this issue by adding work queue thread fc_exch_workqueue for exch timeout work scheduling and then flush this work queue before EM mempool gets destroyed. The cancel_delayed_work_sync cannot be called during final fc_exch_reset due to lport locking orders, so removes related comment block not relevant any more. More details on this issue is discussed in this email thread http://www.open-fcoe.org/pipermail/devel/2009-August/003439.html RFC notes:- This issue is with rare race to reproduce, so I added test code to simulate this race. I added test code to start dummy large exch timer on a exch during its final release in fc_exch_release() and then made sure this exch is finally released only in timer execution. I ensured this by adding additional exch tracking state and setting ep->resp to null. I also skipped cancel_delayed_work in fc_exch_reset code path for this dummy exch, so that this will be released during added flush_workqueue() forcing fc_exch_timeout() handler call before its large timeout value. Now at destroy I do see dummy exch still pending before added flush_workqueue() but it remains pending even after finished calling flush_workqueue(). Though I expected flush_workqueue() to force exch timer work execution and then block until all pending work flushed. I see later dummy exch timer fires on its timeout value and then only dummy exch get freed, I also had test code waiting for all exch to be freed for large dummy exch timeout value. So it seems either flush_workqueue() is not working as expected in this case or I messed up in testing while simulating this race. Any thoughts, why flush_workqueue() was not blocked in this case? I don't see the issue earlier fix had with calling flush_scheduled_work() from another work context fcoe_destroy_work. So perhaps Joe you can test this fix since you were seeing this race to confirm my testing was messed up and fix works. Signed-off-by: Vasu Dev <[email protected]> --- drivers/scsi/libfc/fc_exch.c | 16 ++++++++-------- 1 files changed, 8 insertions(+), 8 deletions(-) diff --git a/drivers/scsi/libfc/fc_exch.c b/drivers/scsi/libfc/fc_exch.c index c1c1574..993e008 100644 --- a/drivers/scsi/libfc/fc_exch.c +++ b/drivers/scsi/libfc/fc_exch.c @@ -36,6 +36,7 @@ u16 fc_cpu_mask; /* cpu mask for possible cpus */ EXPORT_SYMBOL(fc_cpu_mask); static u16 fc_cpu_order; /* 2's power to represent total possible cpus */ static struct kmem_cache *fc_em_cachep; /* cache for exchanges */ +struct workqueue_struct *fc_exch_workqueue; /* * Structure and function definitions for managing Fibre Channel Exchanges @@ -354,8 +355,8 @@ static inline void fc_exch_timer_set_locked(struct fc_exch *ep, FC_EXCH_DBG(ep, "Exchange timer armed\n"); - if (schedule_delayed_work(&ep->timeout_work, - msecs_to_jiffies(timer_msec))) + if (queue_delayed_work(fc_exch_workqueue, &ep->timeout_work, + msecs_to_jiffies(timer_msec))) fc_exch_hold(ep); /* hold for timer */ } @@ -1446,12 +1447,6 @@ static void fc_exch_reset(struct fc_exch *ep) spin_lock_bh(&ep->ex_lock); ep->state |= FC_EX_RST_CLEANUP; - /* - * we really want to call del_timer_sync, but cannot due - * to the lport calling with the lport lock held (some resp - * functions can also grab the lport lock which could cause - * a deadlock). - */ if (cancel_delayed_work(&ep->timeout_work)) atomic_dec(&ep->ex_refcnt); /* drop hold for timer */ resp = ep->resp; @@ -1898,6 +1893,7 @@ void fc_exch_mgr_free(struct fc_lport *lport) { struct fc_exch_mgr_anchor *ema, *next; + flush_workqueue(fc_exch_workqueue); list_for_each_entry_safe(ema, next, &lport->ema_list, ema_list) fc_exch_mgr_del(ema); } @@ -2069,6 +2065,9 @@ int fc_exch_init(struct fc_lport *lp) } fc_cpu_mask--; + fc_exch_workqueue = create_singlethread_workqueue("fc_exch_workqueue"); + if (!fc_exch_workqueue) + return -ENOMEM; return 0; } EXPORT_SYMBOL(fc_exch_init); @@ -2084,5 +2083,6 @@ int fc_setup_exch_mgr(void) void fc_destroy_exch_mgr(void) { + destroy_workqueue(fc_exch_workqueue); kmem_cache_destroy(fc_em_cachep); } _______________________________________________ devel mailing list [email protected] http://www.open-fcoe.org/mailman/listinfo/devel
