On Fri, 2009-08-07 at 13:07 -0700, Joe Eykholt wrote:
> Vasu Dev wrote:
> > In case of i/f destroy all exches must be freed before its EM
> > mempool destroyed but currently some exches could be still
> > releasing in their scheduled work while EM mempool destroy called.
> >
> > Fixing this issue by calling flush_scheduled_work to complete all
> > pending exches related work before EM mempool destroyed during
> > i/f destroy.
> >
> > The cancel_delayed_work_sync cannot be called during final
> > fc_exch_reset to complete all exch work due to lport locking
> > orders, so removes related comment block not relevant any more.
> >
> > More details on this issue is discussed in this email thread
> > http://www.open-fcoe.org/pipermail/devel/2009-August/003439.html
> >
> > RFC notes:-
> >
> > Now I'm running into another issue with added flush_scheduled_work,
> > this forces all system work q flushed and that includes
> > fc_host work for fc_rport_final_delete and that threads hangs
> > with three locks held fc_host->work_q_name, rport->rport_delete_work
> > and shost->scan_mutex. I don't see any of these locks held when
> > added flush_scheduled_work called and I suppose this issue must
> > have got fixed by Joe's pending rport deletion related fixes.
> > Also I couldn't reproduce this issue here before this patch also,
> > looks like rare race.
>
> I just posted a hang something like those on linux-scsi,
> and there's another hang I reported in July that's similar.
> These are the two postings:
>
> http://marc.info/?l=linux-scsi&m=124966844805471&w=2
> http://marc.info/?l=linux-scsi&m=124856078500527&w=2
>
> I've been talking to Mike Christie about possible ways to fix the latter.
>
Good that you guys on top of this issue. My frequent hang issue with
this patch and its logs looks same as in your second post as far as I
can tell, here are my logs.
[ 285.989083] rport-6:0-2: blocked FC remote port time out: leaving
rport and starget alone
[ 285.989096] rport-6:0-3: blocked FC remote port time out: leaving
rport and starget alone
[ 285.989104] rport-6:0-4: blocked FC remote port time out: leaving
rport and starget alone
[ 359.945988] INFO: task fc_wq_6:7437 blocked for more than 120
seconds.
[ 359.945996] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 359.946001] fc_wq_6 D ffff88033dc13b40 5704 7437 2
0x00000080
[ 359.946027] ffff88033dc13b30 0000000000000046 ffff88033dc13ac0
ffff88013dd3cc00
[ 359.946043] ffff88033ad74340 0000000000011c00 0000000000011c00
ffff88033ad746d8
[ 359.946057] 0000000000011c00 0000000000011c00 0000000000011c00
ffff88033ad746d8
[ 359.946071] Call Trace:
[ 359.946085] [<ffffffff8105cb67>] async_synchronize_cookie_domain
+0xb1/0x10d
[ 359.946093] [<ffffffff810570d3>] ? autoremove_wake_function+0x0/0x34
[ 359.946102] [<ffffffff81127b52>] ? sysfs_hash_and_remove+0x4e/0x65
[ 359.946109] [<ffffffff8105cbd3>] async_synchronize_cookie+0x10/0x12
[ 359.946116] [<ffffffff8105cbe5>] async_synchronize_full+0x10/0x2c
[ 359.946127] [<ffffffffa0001711>] sd_remove+0x15/0x8b [sd_mod]
[ 359.946137] [<ffffffff81295d1e>] __device_release_driver+0x67/0xb0
[ 359.946146] [<ffffffff81295e35>] device_release_driver+0x1e/0x2b
[ 359.946153] [<ffffffff812951cb>] bus_remove_device+0xa8/0xbd
[ 359.946159] [<ffffffff81293ae0>] device_del+0x132/0x181
[ 359.946167] [<ffffffff812a9758>] __scsi_remove_device+0x44/0x81
[ 359.946173] [<ffffffff812a97bb>] scsi_remove_device+0x26/0x33
[ 359.946179] [<ffffffff812a9871>] __scsi_remove_target+0x97/0xd5
[ 359.946184] [<ffffffff812a98f6>] ? __remove_child+0x0/0x25
[ 359.946190] [<ffffffff812a9914>] __remove_child+0x1e/0x25
[ 359.946196] [<ffffffff812a98f6>] ? __remove_child+0x0/0x25
[ 359.946202] [<ffffffff8129343f>] device_for_each_child+0x35/0x6c
[ 359.946208] [<ffffffff812a98e9>] scsi_remove_target+0x3a/0x47
[ 359.946219] [<ffffffffa0061de8>] fc_starget_delete+0x21/0x25
[scsi_transport_fc]
[ 359.946230] [<ffffffffa0061ecb>] fc_rport_final_delete+0xdf/0x146
[scsi_transport_fc]
[ 359.946240] [<ffffffffa0061dec>] ? fc_rport_final_delete+0x0/0x146
[scsi_transport_fc]
[ 359.946248] [<ffffffff81052918>] run_workqueue+0x112/0x20c
[ 359.946256] [<ffffffff810528c7>] ? run_workqueue+0xc1/0x20c
[ 359.946265] [<ffffffff81052afc>] worker_thread+0xea/0xfb
[ 359.946275] [<ffffffff810570d3>] ? autoremove_wake_function+0x0/0x34
[ 359.946286] [<ffffffff81052a12>] ? worker_thread+0x0/0xfb
[ 359.946297] [<ffffffff81056d92>] kthread+0x85/0x8d
[ 359.946308] [<ffffffff8100cbfa>] child_rip+0xa/0x20
[ 359.946321] [<ffffffff8100c5bc>] ? restore_args+0x0/0x30
[ 359.946327] [<ffffffff81056cec>] ? kthreadd+0x107/0x128
[ 359.946334] [<ffffffff81056d0d>] ? kthread+0x0/0x8d
[ 359.946340] [<ffffffff8100cbf0>] ? child_rip+0x0/0x20
[ 359.946345] 3 locks held by fc_wq_6/7437:
[ 359.946349] #0: ((fc_host->work_q_name)){+.+.+.}, at:
[<ffffffff810528c7>] run_workqueue+0xc1/0x20c
[ 359.946366] #1: (&rport->rport_delete_work){+.+.+.}, at:
[<ffffffff810528c7>] run_workqueue+0xc1/0x20c
[ 359.946381] #2: (&shost->scan_mutex){+.+.+.}, at:
[<ffffffff812a97b3>] scsi_remove_device+0x1e/0x33
[ 359.946398] INFO: task async/4:7484 blocked for more than 120
seconds.
[ 359.946402] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 359.946407] async/4 D 0000000000000002 5080 7484 2
0x00000080
[ 359.946418] ffff880339c019a0 0000000000000046 ffff88033dcbaa80
ffff88033adb6b40
[ 359.946432] 000000000000005b 0000000000011c00 0000000000011c00
ffff88033dcbae18
[ 359.946447] 0000000000011c00 0000000000011c00 0000000000011c00
ffff88033dcbae18
[ 359.946460] Call Trace:
[ 359.946470] [<ffffffff814f6e4d>] schedule_timeout+0x22/0xc5
[ 359.946477] [<ffffffff814f844b>] ? _spin_unlock_irq+0x2b/0x31
[ 359.946485] [<ffffffff81066484>] ? trace_hardirqs_on_caller
+0x10c/0x130
[ 359.946494] [<ffffffff810664b5>] ? trace_hardirqs_on+0xd/0xf
[ 359.946503] [<ffffffff814f844b>] ? _spin_unlock_irq+0x2b/0x31
[ 359.946512] [<ffffffff814f6d13>] wait_for_common+0xc1/0x120
[ 359.946524] [<ffffffff8103b43a>] ? default_wake_function+0x0/0xf
[ 359.946537] [<ffffffff814f6dfc>] wait_for_completion+0x18/0x1a
[ 359.946550] [<ffffffff811c71a8>] blk_execute_rq+0xaf/0xcf
[ 359.946562] [<ffffffff810d11bc>] ? __kmalloc+0x1b1/0x1c3
[ 359.946576] [<ffffffff810f9107>] ? bio_phys_segments+0x17/0x1f
[ 359.946587] [<ffffffff812a618e>] scsi_execute+0xf1/0x143
[ 359.946593] [<ffffffff812a6260>] scsi_execute_req+0x80/0xb2
[ 359.946602] [<ffffffffa000180a>] read_capacity_10+0x83/0x1a9
[sd_mod]
[ 359.946611] [<ffffffffa000286e>] sd_read_capacity+0x58b/0x5b1
[sd_mod]
[ 359.946619] [<ffffffff81060000>] ? clocksource_change_rating
+0x3c/0x99
[ 359.946627] [<ffffffffa0002dcc>] sd_revalidate_disk+0x111/0x1d5
[sd_mod]
[ 359.946635] [<ffffffffa0002f5e>] sd_probe_async+0xce/0x1d4 [sd_mod]
[ 359.946642] [<ffffffff810664b5>] ? trace_hardirqs_on+0xd/0xf
[ 359.946647] [<ffffffff8105c996>] ? async_thread+0x0/0xb6
[ 359.946653] [<ffffffff8105c8ea>] run_one_entry+0xa4/0x150
[ 359.946659] [<ffffffff8105c9fc>] async_thread+0x66/0xb6
[ 359.946666] [<ffffffff8103b43a>] ? default_wake_function+0x0/0xf
[ 359.946672] [<ffffffff8105c996>] ? async_thread+0x0/0xb6
Earlier got same lock dep warning as in your second post:
[72135.775948] =======================================================
[72135.776363] [ INFO: possible circular locking dependency detected ]
[72135.776574] 2.6.31-rc4-upstream #49
[72135.776780] -------------------------------------------------------
[72135.776990] dd/14168 is trying to acquire lock:
[72135.777197] (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff81042d96>]
cpu_maps_update_begin+0x12/0x14
[72135.777621]
[72135.777621] but task is already holding lock:
[72135.778025] (sd_ref_mutex){+.+.+.}, at: [<ffffffffa0001058>]
scsi_disk_put+0x1c/0x3e [sd_mod]
[72135.778446]
[72135.778447] which lock already depends on the new lock.
[72135.778447]
[72135.779056]
[72135.779056] the existing dependency chain (in reverse order) is:
[72135.779469]
[72135.779469] -> #4 (sd_ref_mutex){+.+.+.}:
[72135.779883] [<ffffffff81067002>] check_prev_add+0x146/0x1c7
[72135.780100] [<ffffffff810674c3>] validate_chain+0x440/0x4da
[72135.780314] [<ffffffff8106784f>] __lock_acquire+0x2f2/0x363
[72135.780529] [<ffffffff8106798a>] lock_acquire+0xca/0xee
[72135.780739] [<ffffffff814f73c7>] __mutex_lock_common
+0x47/0x331
[72135.780953] [<ffffffff814f7760>] mutex_lock_nested+0x37/0x3c
[72135.781166] [<ffffffffa0001009>] scsi_disk_get_from_dev
+0x1a/0x4d [sd_mod]
[72135.781578] [<ffffffffa000150c>] sd_shutdown+0x12/0x116
[sd_mod]
[72135.781793] [<ffffffffa000174f>] sd_remove+0x53/0x8b [sd_mod]
[72135.782007] [<ffffffff81295d1e>] __device_release_driver
+0x67/0xb0
[72135.782226] [<ffffffff81295e35>] device_release_driver
+0x1e/0x2b
[72135.782441] [<ffffffff812951cb>] bus_remove_device+0xa8/0xbd
[72135.782655] [<ffffffff81293ae0>] device_del+0x132/0x181
[72135.782868] [<ffffffff812a9758>] __scsi_remove_device
+0x44/0x81
[72135.783084] [<ffffffff812a97bb>] scsi_remove_device+0x26/0x33
[72135.783299] [<ffffffff812a9871>] __scsi_remove_target
+0x97/0xd5
[72135.783513] [<ffffffff812a9914>] __remove_child+0x1e/0x25
[72135.783724] [<ffffffff8129343f>] device_for_each_child
+0x35/0x6c
[72135.783724] [<ffffffff8129343f>] device_for_each_child
+0x35/0x6c
[72135.783936] [<ffffffff812a98e9>] scsi_remove_target+0x3a/0x47
[72135.784149] [<ffffffffa0061de8>] fc_starget_delete+0x21/0x25
[scsi_transport_fc]
[72135.784564] [<ffffffffa0061ecb>] fc_rport_final_delete
+0xdf/0x146 [scsi_transport_fc]
[72135.784980] [<ffffffff81052918>] run_workqueue+0x112/0x20c
[72135.785192] [<ffffffff81052afc>] worker_thread+0xea/0xfb
[72135.785402] [<ffffffff81056d92>] kthread+0x85/0x8d
[72135.785614] [<ffffffff8100cbfa>] child_rip+0xa/0x20
[72135.785825] [<ffffffffffffffff>] 0xffffffffffffffff
[72135.786036]
[72135.786036] -> #3 (&shost->scan_mutex){+.+.+.}:
[72135.786447] [<ffffffff81067002>] check_prev_add+0x146/0x1c7
[72135.786658] [<ffffffff810674c3>] validate_chain+0x440/0x4da
[72135.786870] [<ffffffff8106784f>] __lock_acquire+0x2f2/0x363
[72135.787081] [<ffffffff8106798a>] lock_acquire+0xca/0xee
[72135.787292] [<ffffffff814f73c7>] __mutex_lock_common
+0x47/0x331
[72135.787503] [<ffffffff814f7760>] mutex_lock_nested+0x37/0x3c
[72135.787715] [<ffffffff812a97b3>] scsi_remove_device+0x1e/0x33
[72135.787927] [<ffffffff812a9871>] __scsi_remove_target
+0x97/0xd5
[72135.788141] [<ffffffff812a9914>] __remove_child+0x1e/0x25
[72135.788353] [<ffffffff8129343f>] device_for_each_child
+0x35/0x6c
[72135.788567] [<ffffffff812a98e9>] scsi_remove_target+0x3a/0x47
[72135.788780] [<ffffffffa0061de8>] fc_starget_delete+0x21/0x25
[scsi_transport_fc]
[72135.789192] [<ffffffffa0061ecb>] fc_rport_final_delete
+0xdf/0x146 [scsi_transport_fc]
[72135.789608] [<ffffffff81052918>] run_workqueue+0x112/0x20c
[72135.789822] [<ffffffff81052afc>] worker_thread+0xea/0xfb
[72135.790034] [<ffffffff81056d92>] kthread+0x85/0x8d
[72135.790245] [<ffffffff8100cbfa>] child_rip+0xa/0x20
[72135.790456] [<ffffffffffffffff>] 0xffffffffffffffff
[72135.790666]
[72135.790666] -> #2 (&rport->rport_delete_work){+.+.+.}:
[72135.791081] [<ffffffff81067002>] check_prev_add+0x146/0x1c7
[72135.791293] [<ffffffff810674c3>] validate_chain+0x440/0x4da
[72135.791506] [<ffffffff8106784f>] __lock_acquire+0x2f2/0x363
[72135.791721] [<ffffffff8106798a>] lock_acquire+0xca/0xee
[72135.791932] [<ffffffff81052912>] run_workqueue+0x10c/0x20c
[72135.792145] [<ffffffff81052afc>] worker_thread+0xea/0xfb
[72135.792355] [<ffffffff81056d92>] kthread+0x85/0x8d
[72135.792565] [<ffffffff8100cbfa>] child_rip+0xa/0x20
[72135.792775] [<ffffffffffffffff>] 0xffffffffffffffff
[72135.792985]
[72135.792986] -> #1 ((fc_host->work_q_name)){+.+.+.}:
[72135.793402] [<ffffffff81067002>] check_prev_add+0x146/0x1c7
[72135.793614] [<ffffffff810674c3>] validate_chain+0x440/0x4da
[72135.793827] [<ffffffff8106784f>] __lock_acquire+0x2f2/0x363
[72135.794039] [<ffffffff8106798a>] lock_acquire+0xca/0xee
[72135.794251] [<ffffffff8105302b>] cleanup_workqueue_thread
+0x45/0xb9
[72135.794464] [<ffffffff81053165>] destroy_workqueue+0x78/0xab
[72135.794677] [<ffffffffa00625fd>] fc_remove_host+0x179/0x1a1
[scsi_transport_fc]
[72135.795089] [<ffffffffa0090f61>] 0xffffffffa0090f61
[72135.795299] [<ffffffffa0090fe4>] 0xffffffffa0090fe4
[72135.795509] [<ffffffff81055171>] param_attr_store+0x22/0x30
[72135.795721] [<ffffffff810551c4>] module_attr_store+0x21/0x25
[72135.795932] [<ffffffff811284a5>] sysfs_write_file+0xf6/0x132
[72135.796145] [<ffffffff810d6a57>] vfs_write+0xa9/0x106
[72135.796359] [<ffffffff810d6b6a>] sys_write+0x45/0x69
[72135.796569] [<ffffffff8100baeb>] system_call_fastpath
+0x16/0x1b
[72135.796783] [<ffffffffffffffff>] 0xffffffffffffffff
[72135.796998]
[72135.796998] -> #0 (cpu_add_remove_lock){+.+.+.}:
[72135.797413] [<ffffffff81066f02>] check_prev_add+0x46/0x1c7
[72135.797626] [<ffffffff810674c3>] validate_chain+0x440/0x4da
[72135.797837] [<ffffffff8106784f>] __lock_acquire+0x2f2/0x363
[72135.798051] [<ffffffff8106798a>] lock_acquire+0xca/0xee
[72135.798264] [<ffffffff814f73c7>] __mutex_lock_common
+0x47/0x331
[72135.798477] [<ffffffff814f7760>] mutex_lock_nested+0x37/0x3c
[72135.798689] [<ffffffff81042d96>] cpu_maps_update_begin
+0x12/0x14
[72135.798902] [<ffffffff81053119>] destroy_workqueue+0x2c/0xab
[72135.799114] [<ffffffff812a09df>] scsi_host_dev_release
+0x56/0xb9
[72135.799329] [<ffffffff81293360>] device_release+0x49/0x76
[72135.799541] [<ffffffff811d3d83>] kobject_release+0x51/0x66
[72135.799758] [<ffffffff811d4b05>] kref_put+0x43/0x4d
[72135.799969] [<ffffffff811d3c8c>] kobject_put+0x47/0x4b
[72135.800181] [<ffffffff81292c92>] put_device+0x12/0x14
[72135.800392] [<ffffffffa006137f>] fc_rport_dev_release
+0x14/0x21 [scsi_transport_fc]
[72135.800802] [<ffffffff81293360>] device_release+0x49/0x76
[72135.801012] [<ffffffff811d3d83>] kobject_release+0x51/0x66
[72135.801223] [<ffffffff811d4b05>] kref_put+0x43/0x4d
[72135.801432] [<ffffffff811d3c8c>] kobject_put+0x47/0x4b
[72135.801642] [<ffffffff81292c92>] put_device+0x12/0x14
[72135.801852] [<ffffffff812a6922>] scsi_target_dev_release
+0x1d/0x21
[72135.802067] [<ffffffff81293360>] device_release+0x49/0x76
[72135.802280] [<ffffffff811d3d83>] kobject_release+0x51/0x66
[72135.802491] [<ffffffff811d4b05>] kref_put+0x43/0x4d
[72135.802703] [<ffffffff811d3c8c>] kobject_put+0x47/0x4b
[72135.802913] [<ffffffff81292c92>] put_device+0x12/0x14
[72135.803122] [<ffffffff812a93fb>]
scsi_device_dev_release_usercontext+0x189/0x195
[72135.803532] [<ffffffff81053aad>] execute_in_process_context
+0x2a/0x70
[72135.803746] [<ffffffff812a9270>] scsi_device_dev_release
+0x17/0x19
[72135.803958] [<ffffffff81293360>] device_release+0x49/0x76
[72135.809046] [<ffffffff811d3d83>] kobject_release+0x51/0x66
[72135.809259] [<ffffffff811d4b05>] kref_put+0x43/0x4d
[72135.809470] [<ffffffff811d3c8c>] kobject_put+0x47/0x4b
[72135.809682] [<ffffffff81292c92>] put_device+0x12/0x14
[72135.809893] [<ffffffff8129faad>] scsi_device_put+0x3d/0x42
[72135.810105] [<ffffffffa0001069>] scsi_disk_put+0x2d/0x3e
[sd_mod]
[72135.810319] [<ffffffffa0001b13>] sd_release+0x4d/0x54 [sd_mod]
[72135.810532] [<ffffffff810fadb3>] __blkdev_put+0x80/0x138
[72135.810746] [<ffffffff810fae76>] blkdev_put+0xb/0xd
[72135.810957] [<ffffffff810faeb1>] blkdev_close+0x39/0x3e
[72135.811170] [<ffffffff810d746f>] __fput+0x123/0x1d0
[72135.811381] [<ffffffff810d7531>] fput+0x15/0x17
[72135.811590] [<ffffffff810d46b5>] filp_close+0x63/0x6d
[72135.811801] [<ffffffff810d4767>] sys_close+0xa8/0xe2
[72135.812011] [<ffffffff8100baeb>] system_call_fastpath
+0x16/0x1b
[72135.812225] [<ffffffffffffffff>] 0xffffffffffffffff
[72135.812435]
[72135.812435] other info that might help us debug this:
[72135.812436]
[72135.813044] 2 locks held by dd/14168:
[72135.813249] #0: (&bdev->bd_mutex){+.+.+.}, at: [<ffffffff810fad66>]
__blkdev_put+0x33/0x138
[72135.813668] #1: (sd_ref_mutex){+.+.+.}, at: [<ffffffffa0001058>]
scsi_disk_put+0x1c/0x3e [sd_mod]
[72135.814088]
> > So Joe could you please test this fix in your setup with your
> > rport deletion related fix applied ?
>
> Yes. I'll do my tests with this.
>
Thanks.
>
> >
> > Signed-off-by: Vasu Dev <[email protected]>
> > ---
> >
> > drivers/scsi/libfc/fc_exch.c | 7 +------
> > 1 files changed, 1 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/scsi/libfc/fc_exch.c b/drivers/scsi/libfc/fc_exch.c
> > index b51db15..9c754d5 100644
> > --- a/drivers/scsi/libfc/fc_exch.c
> > +++ b/drivers/scsi/libfc/fc_exch.c
> > @@ -1446,12 +1446,6 @@ static void fc_exch_reset(struct fc_exch *ep)
> >
> > spin_lock_bh(&ep->ex_lock);
> > ep->state |= FC_EX_RST_CLEANUP;
> > - /*
> > - * we really want to call del_timer_sync, but cannot due
> > - * to the lport calling with the lport lock held (some resp
> > - * functions can also grab the lport lock which could cause
> > - * a deadlock).
> > - */
> > if (cancel_delayed_work(&ep->timeout_work))
> > atomic_dec(&ep->ex_refcnt); /* drop hold for timer */
> > resp = ep->resp;
> > @@ -1898,6 +1892,7 @@ void fc_exch_mgr_free(struct fc_lport *lport)
> > {
> > struct fc_exch_mgr_anchor *ema, *next;
> >
> > + flush_scheduled_work();
> > list_for_each_entry_safe(ema, next, &lport->ema_list, ema_list)
> > fc_exch_mgr_del(ema);
> > }
> >
> > _______________________________________________
> > devel mailing list
> > [email protected]
> > http://www.open-fcoe.org/mailman/listinfo/devel
>
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-fcoe.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-fcoe.org/mailman/listinfo/devel