Vasu Dev wrote:
> In case of i/f destroy all exches must be freed before its EM
> mempool destroyed but currently some exches could be still
> releasing in their scheduled work while EM mempool destroy called.
>
> Fixing this issue by calling flush_scheduled_work to complete all
> pending exches related work before EM mempool destroyed during
> i/f destroy.
>
> The cancel_delayed_work_sync cannot be called during final
> fc_exch_reset to complete all exch work due to lport locking
> orders, so removes related comment block not relevant any more.
>
> More details on this issue is discussed in this email thread
> http://www.open-fcoe.org/pipermail/devel/2009-August/003439.html
>
> RFC notes:-
>
> Now I'm running into another issue with added flush_scheduled_work,
> this forces all system work q flushed and that includes
> fc_host work for fc_rport_final_delete and that threads hangs
> with three locks held fc_host->work_q_name, rport->rport_delete_work
> and shost->scan_mutex. I don't see any of these locks held when
> added flush_scheduled_work called and I suppose this issue must
> have got fixed by Joe's pending rport deletion related fixes.
> Also I couldn't reproduce this issue here before this patch also,
> looks like rare race.
>
> So Joe could you please test this fix in your setup with your
> rport deletion related fix applied ?
>
> Signed-off-by: Vasu Dev <[email protected]>
> ---
>
> drivers/scsi/libfc/fc_exch.c | 7 +------
> 1 files changed, 1 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/scsi/libfc/fc_exch.c b/drivers/scsi/libfc/fc_exch.c
> index b51db15..9c754d5 100644
> --- a/drivers/scsi/libfc/fc_exch.c
> +++ b/drivers/scsi/libfc/fc_exch.c
> @@ -1446,12 +1446,6 @@ static void fc_exch_reset(struct fc_exch *ep)
>
> spin_lock_bh(&ep->ex_lock);
> ep->state |= FC_EX_RST_CLEANUP;
> - /*
> - * we really want to call del_timer_sync, but cannot due
> - * to the lport calling with the lport lock held (some resp
> - * functions can also grab the lport lock which could cause
> - * a deadlock).
> - */
> if (cancel_delayed_work(&ep->timeout_work))
> atomic_dec(&ep->ex_refcnt); /* drop hold for timer */
> resp = ep->resp;
> @@ -1898,6 +1892,7 @@ void fc_exch_mgr_free(struct fc_lport *lport)
> {
> struct fc_exch_mgr_anchor *ema, *next;
>
> + flush_scheduled_work();
> list_for_each_entry_safe(ema, next, &lport->ema_list, ema_list)
> fc_exch_mgr_del(ema);
> }
>
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-fcoe.org/mailman/listinfo/devel
Vasu,
I applied this patch on top of other recent patches. Now that the
fcoe_destroy_work() is done on a work thread, we can't do a
flush_scheduled_work() here. It flushes the work queue it's
running on, causing that worker thread to hang forever.
The result is:
[346173.781137] WARNING: at kernel/workqueue.c:368
flush_cpu_workqueue+0x31/0x87()
[346173.788660] Hardware name: X7DB8[346173.792063] Modules linked in: fcoe(-)
libfcoe libfc st
scsi_transport_fc e1000e ixgbe mdio [last unloaded: libfc][346173.802866] Pid:
15, comm: events/0 Not tainted
2.6.31-rc4-rp9 #8
[346173.809163] Call Trace:
[346173.811740] [<ffffffff81052d73>] ?
flush_cpu_workqueue+0x31/0x87[346173.818067] [<ffffffff8104183e>]
warn_slowpath_common+0x77/0xa4
[346173.824321] [<ffffffff8104187a>] warn_slowpath_null+0xf/0x11
[346173.830257] [<ffffffff81052d73>] flush_cpu_workqueue+0x31/0x87
[346173.836385] [<ffffffff81053491>] ? flush_workqueue+0x0/0x9d
[346173.842244] [<ffffffff81053509>] flush_workqueue+0x78/0x9d
[346173.848025] [<ffffffff81053491>] ? flush_workqueue+0x0/0x9d
[346173.853893] [<ffffffff8105353e>] flush_scheduled_work+0x10/0x12
[346173.860088] [<ffffffffa0230e64>] fc_exch_mgr_free+0xf/0x32 [libfc]
[346173.866586] [<ffffffffa024d8c3>] fcoe_if_destroy+0x1b7/0x1db [fcoe]
[346173.873170] [<ffffffffa024d90d>] fcoe_destroy_work+0x26/0x36 [fcoe]
[346173.879731] [<ffffffff810527ec>] worker_thread+0x1fa/0x30a
[346173.885509] [<ffffffff81052795>] ? worker_thread+0x1a3/0x30a
[346173.891469] [<ffffffff8150cb95>] ? thread_return+0x3e/0xc3
[346173.897247] [<ffffffffa024d8e7>] ? fcoe_destroy_work+0x0/0x36 [fcoe]
This makes the system unusable since one event thread is locked up.
I don't have a suggested fix at this point.
Joe
_______________________________________________
devel mailing list
[email protected]
http://www.open-fcoe.org/mailman/listinfo/devel