Vasu Dev wrote:
> In case of i/f destroy all exches must be freed before its EM
> mempool destroyed but currently some exches could be still
> releasing in their scheduled work while EM mempool destroy called.
> 
> Fixing this issue by calling flush_scheduled_work to complete all
> pending exches related work before EM mempool destroyed during
> i/f destroy.
> 
> The cancel_delayed_work_sync cannot be called during final
> fc_exch_reset to complete all exch work due to lport locking
> orders, so removes related comment block not relevant any more.
> 
> More details on this issue is discussed in this email thread
> http://www.open-fcoe.org/pipermail/devel/2009-August/003439.html
> 
> RFC notes:-
> 
>   Now I'm running into another issue with added flush_scheduled_work,
> this forces all system work q flushed and that includes
> fc_host work for fc_rport_final_delete and that threads hangs
> with three locks held fc_host->work_q_name, rport->rport_delete_work
> and shost->scan_mutex. I don't see any of these locks held when
> added flush_scheduled_work called and I suppose this issue must
> have got fixed by Joe's pending rport deletion related fixes.
> Also I couldn't reproduce this issue here before this patch also,
> looks like rare race.
> 
>    So Joe could you please test this fix in your setup with your
> rport deletion related fix applied ?
> 
> Signed-off-by: Vasu Dev <[email protected]>
> ---
> 
>  drivers/scsi/libfc/fc_exch.c |    7 +------
>  1 files changed, 1 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/scsi/libfc/fc_exch.c b/drivers/scsi/libfc/fc_exch.c
> index b51db15..9c754d5 100644
> --- a/drivers/scsi/libfc/fc_exch.c
> +++ b/drivers/scsi/libfc/fc_exch.c
> @@ -1446,12 +1446,6 @@ static void fc_exch_reset(struct fc_exch *ep)
>  
>       spin_lock_bh(&ep->ex_lock);
>       ep->state |= FC_EX_RST_CLEANUP;
> -     /*
> -      * we really want to call del_timer_sync, but cannot due
> -      * to the lport calling with the lport lock held (some resp
> -      * functions can also grab the lport lock which could cause
> -      * a deadlock).
> -      */
>       if (cancel_delayed_work(&ep->timeout_work))
>               atomic_dec(&ep->ex_refcnt);     /* drop hold for timer */
>       resp = ep->resp;
> @@ -1898,6 +1892,7 @@ void fc_exch_mgr_free(struct fc_lport *lport)
>  {
>       struct fc_exch_mgr_anchor *ema, *next;
>  
> +     flush_scheduled_work();
>       list_for_each_entry_safe(ema, next, &lport->ema_list, ema_list)
>               fc_exch_mgr_del(ema);
>  }
> 
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-fcoe.org/mailman/listinfo/devel

Vasu,

I applied this patch on top of other recent patches.  Now that the
fcoe_destroy_work() is done on a work thread, we can't do a
flush_scheduled_work() here.   It flushes the work queue it's
running on, causing that worker thread to hang forever.

The result is:

[346173.781137] WARNING: at kernel/workqueue.c:368 
flush_cpu_workqueue+0x31/0x87()
[346173.788660] Hardware name: X7DB8[346173.792063] Modules linked in: fcoe(-) 
libfcoe libfc st 
scsi_transport_fc e1000e ixgbe mdio [last unloaded: libfc][346173.802866] Pid: 
15, comm: events/0 Not tainted 
2.6.31-rc4-rp9 #8
[346173.809163] Call Trace:
[346173.811740]  [<ffffffff81052d73>] ? 
flush_cpu_workqueue+0x31/0x87[346173.818067]  [<ffffffff8104183e>] 
warn_slowpath_common+0x77/0xa4
[346173.824321]  [<ffffffff8104187a>] warn_slowpath_null+0xf/0x11
[346173.830257]  [<ffffffff81052d73>] flush_cpu_workqueue+0x31/0x87
[346173.836385]  [<ffffffff81053491>] ? flush_workqueue+0x0/0x9d
[346173.842244]  [<ffffffff81053509>] flush_workqueue+0x78/0x9d
[346173.848025]  [<ffffffff81053491>] ? flush_workqueue+0x0/0x9d
[346173.853893]  [<ffffffff8105353e>] flush_scheduled_work+0x10/0x12
[346173.860088]  [<ffffffffa0230e64>] fc_exch_mgr_free+0xf/0x32 [libfc]
[346173.866586]  [<ffffffffa024d8c3>] fcoe_if_destroy+0x1b7/0x1db [fcoe]
[346173.873170]  [<ffffffffa024d90d>] fcoe_destroy_work+0x26/0x36 [fcoe]
[346173.879731]  [<ffffffff810527ec>] worker_thread+0x1fa/0x30a
[346173.885509]  [<ffffffff81052795>] ? worker_thread+0x1a3/0x30a
[346173.891469]  [<ffffffff8150cb95>] ? thread_return+0x3e/0xc3
[346173.897247]  [<ffffffffa024d8e7>] ? fcoe_destroy_work+0x0/0x36 [fcoe]

This makes the system unusable since one event thread is locked up.
I don't have a suggested fix at this point.

        Joe



_______________________________________________
devel mailing list
[email protected]
http://www.open-fcoe.org/mailman/listinfo/devel

Reply via email to