On Fri, 2009-08-07 at 13:07 -0700, Joe Eykholt wrote:
> Vasu Dev wrote:
> > In case of i/f destroy all exches must be freed before its EM
> > mempool destroyed but currently some exches could be still
> > releasing in their scheduled work while EM mempool destroy called.
> > 
> > Fixing this issue by calling flush_scheduled_work to complete all
> > pending exches related work before EM mempool destroyed during
> > i/f destroy.
> > 
> > The cancel_delayed_work_sync cannot be called during final
> > fc_exch_reset to complete all exch work due to lport locking
> > orders, so removes related comment block not relevant any more.
> > 
> > More details on this issue is discussed in this email thread
> > http://www.open-fcoe.org/pipermail/devel/2009-August/003439.html
> > 
> > RFC notes:-
> > 
> >   Now I'm running into another issue with added flush_scheduled_work,
> > this forces all system work q flushed and that includes
> > fc_host work for fc_rport_final_delete and that threads hangs
> > with three locks held fc_host->work_q_name, rport->rport_delete_work
> > and shost->scan_mutex. I don't see any of these locks held when
> > added flush_scheduled_work called and I suppose this issue must
> > have got fixed by Joe's pending rport deletion related fixes.
> > Also I couldn't reproduce this issue here before this patch also,
> > looks like rare race.
> 
> I just posted a hang something like those on linux-scsi,
> and there's another hang I reported in July that's similar.
> These are the two postings:
> 
> http://marc.info/?l=linux-scsi&m=124966844805471&w=2
> http://marc.info/?l=linux-scsi&m=124856078500527&w=2
> 
> I've been talking to Mike Christie about possible ways to fix the latter.
> 

Good that you guys on top of this issue. My frequent hang issue with
this patch and its logs looks same as in your second post as far as I
can tell, here are my logs.


[  285.989083]  rport-6:0-2: blocked FC remote port time out: leaving
rport and starget alone
[  285.989096]  rport-6:0-3: blocked FC remote port time out: leaving
rport and starget alone
[  285.989104]  rport-6:0-4: blocked FC remote port time out: leaving
rport and starget alone
[  359.945988] INFO: task fc_wq_6:7437 blocked for more than 120
seconds.
[  359.945996] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  359.946001] fc_wq_6       D ffff88033dc13b40  5704  7437      2
0x00000080
[  359.946027]  ffff88033dc13b30 0000000000000046 ffff88033dc13ac0
ffff88013dd3cc00
[  359.946043]  ffff88033ad74340 0000000000011c00 0000000000011c00
ffff88033ad746d8
[  359.946057]  0000000000011c00 0000000000011c00 0000000000011c00
ffff88033ad746d8
[  359.946071] Call Trace:
[  359.946085]  [<ffffffff8105cb67>] async_synchronize_cookie_domain
+0xb1/0x10d
[  359.946093]  [<ffffffff810570d3>] ? autoremove_wake_function+0x0/0x34
[  359.946102]  [<ffffffff81127b52>] ? sysfs_hash_and_remove+0x4e/0x65
[  359.946109]  [<ffffffff8105cbd3>] async_synchronize_cookie+0x10/0x12
[  359.946116]  [<ffffffff8105cbe5>] async_synchronize_full+0x10/0x2c
[  359.946127]  [<ffffffffa0001711>] sd_remove+0x15/0x8b [sd_mod]
[  359.946137]  [<ffffffff81295d1e>] __device_release_driver+0x67/0xb0
[  359.946146]  [<ffffffff81295e35>] device_release_driver+0x1e/0x2b
[  359.946153]  [<ffffffff812951cb>] bus_remove_device+0xa8/0xbd
[  359.946159]  [<ffffffff81293ae0>] device_del+0x132/0x181
[  359.946167]  [<ffffffff812a9758>] __scsi_remove_device+0x44/0x81
[  359.946173]  [<ffffffff812a97bb>] scsi_remove_device+0x26/0x33
[  359.946179]  [<ffffffff812a9871>] __scsi_remove_target+0x97/0xd5
[  359.946184]  [<ffffffff812a98f6>] ? __remove_child+0x0/0x25
[  359.946190]  [<ffffffff812a9914>] __remove_child+0x1e/0x25
[  359.946196]  [<ffffffff812a98f6>] ? __remove_child+0x0/0x25
[  359.946202]  [<ffffffff8129343f>] device_for_each_child+0x35/0x6c
[  359.946208]  [<ffffffff812a98e9>] scsi_remove_target+0x3a/0x47
[  359.946219]  [<ffffffffa0061de8>] fc_starget_delete+0x21/0x25
[scsi_transport_fc]
[  359.946230]  [<ffffffffa0061ecb>] fc_rport_final_delete+0xdf/0x146
[scsi_transport_fc]
[  359.946240]  [<ffffffffa0061dec>] ? fc_rport_final_delete+0x0/0x146
[scsi_transport_fc]
[  359.946248]  [<ffffffff81052918>] run_workqueue+0x112/0x20c
[  359.946256]  [<ffffffff810528c7>] ? run_workqueue+0xc1/0x20c
[  359.946265]  [<ffffffff81052afc>] worker_thread+0xea/0xfb
[  359.946275]  [<ffffffff810570d3>] ? autoremove_wake_function+0x0/0x34
[  359.946286]  [<ffffffff81052a12>] ? worker_thread+0x0/0xfb
[  359.946297]  [<ffffffff81056d92>] kthread+0x85/0x8d
[  359.946308]  [<ffffffff8100cbfa>] child_rip+0xa/0x20
[  359.946321]  [<ffffffff8100c5bc>] ? restore_args+0x0/0x30
[  359.946327]  [<ffffffff81056cec>] ? kthreadd+0x107/0x128
[  359.946334]  [<ffffffff81056d0d>] ? kthread+0x0/0x8d
[  359.946340]  [<ffffffff8100cbf0>] ? child_rip+0x0/0x20
[  359.946345] 3 locks held by fc_wq_6/7437:
[  359.946349]  #0:  ((fc_host->work_q_name)){+.+.+.}, at:
[<ffffffff810528c7>] run_workqueue+0xc1/0x20c
[  359.946366]  #1:  (&rport->rport_delete_work){+.+.+.}, at:
[<ffffffff810528c7>] run_workqueue+0xc1/0x20c
[  359.946381]  #2:  (&shost->scan_mutex){+.+.+.}, at:
[<ffffffff812a97b3>] scsi_remove_device+0x1e/0x33
[  359.946398] INFO: task async/4:7484 blocked for more than 120
seconds.
[  359.946402] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  359.946407] async/4       D 0000000000000002  5080  7484      2
0x00000080
[  359.946418]  ffff880339c019a0 0000000000000046 ffff88033dcbaa80
ffff88033adb6b40
[  359.946432]  000000000000005b 0000000000011c00 0000000000011c00
ffff88033dcbae18
[  359.946447]  0000000000011c00 0000000000011c00 0000000000011c00
ffff88033dcbae18
[  359.946460] Call Trace:
[  359.946470]  [<ffffffff814f6e4d>] schedule_timeout+0x22/0xc5
[  359.946477]  [<ffffffff814f844b>] ? _spin_unlock_irq+0x2b/0x31
[  359.946485]  [<ffffffff81066484>] ? trace_hardirqs_on_caller
+0x10c/0x130
[  359.946494]  [<ffffffff810664b5>] ? trace_hardirqs_on+0xd/0xf
[  359.946503]  [<ffffffff814f844b>] ? _spin_unlock_irq+0x2b/0x31
[  359.946512]  [<ffffffff814f6d13>] wait_for_common+0xc1/0x120
[  359.946524]  [<ffffffff8103b43a>] ? default_wake_function+0x0/0xf
[  359.946537]  [<ffffffff814f6dfc>] wait_for_completion+0x18/0x1a
[  359.946550]  [<ffffffff811c71a8>] blk_execute_rq+0xaf/0xcf
[  359.946562]  [<ffffffff810d11bc>] ? __kmalloc+0x1b1/0x1c3
[  359.946576]  [<ffffffff810f9107>] ? bio_phys_segments+0x17/0x1f
[  359.946587]  [<ffffffff812a618e>] scsi_execute+0xf1/0x143
[  359.946593]  [<ffffffff812a6260>] scsi_execute_req+0x80/0xb2
[  359.946602]  [<ffffffffa000180a>] read_capacity_10+0x83/0x1a9
[sd_mod]
[  359.946611]  [<ffffffffa000286e>] sd_read_capacity+0x58b/0x5b1
[sd_mod]
[  359.946619]  [<ffffffff81060000>] ? clocksource_change_rating
+0x3c/0x99
[  359.946627]  [<ffffffffa0002dcc>] sd_revalidate_disk+0x111/0x1d5
[sd_mod]
[  359.946635]  [<ffffffffa0002f5e>] sd_probe_async+0xce/0x1d4 [sd_mod]
[  359.946642]  [<ffffffff810664b5>] ? trace_hardirqs_on+0xd/0xf
[  359.946647]  [<ffffffff8105c996>] ? async_thread+0x0/0xb6
[  359.946653]  [<ffffffff8105c8ea>] run_one_entry+0xa4/0x150
[  359.946659]  [<ffffffff8105c9fc>] async_thread+0x66/0xb6
[  359.946666]  [<ffffffff8103b43a>] ? default_wake_function+0x0/0xf
[  359.946672]  [<ffffffff8105c996>] ? async_thread+0x0/0xb6


Earlier got same lock dep warning as in your second post:

[72135.775948] =======================================================
[72135.776363] [ INFO: possible circular locking dependency detected ]
[72135.776574] 2.6.31-rc4-upstream #49
[72135.776780] -------------------------------------------------------
[72135.776990] dd/14168 is trying to acquire lock:
[72135.777197]  (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff81042d96>]
cpu_maps_update_begin+0x12/0x14
[72135.777621]
[72135.777621] but task is already holding lock:
[72135.778025]  (sd_ref_mutex){+.+.+.}, at: [<ffffffffa0001058>]
scsi_disk_put+0x1c/0x3e [sd_mod]
[72135.778446]
[72135.778447] which lock already depends on the new lock.
[72135.778447]
[72135.779056]
[72135.779056] the existing dependency chain (in reverse order) is:
[72135.779469]
[72135.779469] -> #4 (sd_ref_mutex){+.+.+.}:
[72135.779883]        [<ffffffff81067002>] check_prev_add+0x146/0x1c7
[72135.780100]        [<ffffffff810674c3>] validate_chain+0x440/0x4da
[72135.780314]        [<ffffffff8106784f>] __lock_acquire+0x2f2/0x363
[72135.780529]        [<ffffffff8106798a>] lock_acquire+0xca/0xee
[72135.780739]        [<ffffffff814f73c7>] __mutex_lock_common
+0x47/0x331
[72135.780953]        [<ffffffff814f7760>] mutex_lock_nested+0x37/0x3c
[72135.781166]        [<ffffffffa0001009>] scsi_disk_get_from_dev
+0x1a/0x4d [sd_mod]
[72135.781578]        [<ffffffffa000150c>] sd_shutdown+0x12/0x116
[sd_mod]
[72135.781793]        [<ffffffffa000174f>] sd_remove+0x53/0x8b [sd_mod]
[72135.782007]        [<ffffffff81295d1e>] __device_release_driver
+0x67/0xb0
[72135.782226]        [<ffffffff81295e35>] device_release_driver
+0x1e/0x2b
[72135.782441]        [<ffffffff812951cb>] bus_remove_device+0xa8/0xbd
[72135.782655]        [<ffffffff81293ae0>] device_del+0x132/0x181
[72135.782868]        [<ffffffff812a9758>] __scsi_remove_device
+0x44/0x81
[72135.783084]        [<ffffffff812a97bb>] scsi_remove_device+0x26/0x33
[72135.783299]        [<ffffffff812a9871>] __scsi_remove_target
+0x97/0xd5
[72135.783513]        [<ffffffff812a9914>] __remove_child+0x1e/0x25
[72135.783724]        [<ffffffff8129343f>] device_for_each_child
+0x35/0x6c
[72135.783724]        [<ffffffff8129343f>] device_for_each_child
+0x35/0x6c
[72135.783936]        [<ffffffff812a98e9>] scsi_remove_target+0x3a/0x47
[72135.784149]        [<ffffffffa0061de8>] fc_starget_delete+0x21/0x25
[scsi_transport_fc]
[72135.784564]        [<ffffffffa0061ecb>] fc_rport_final_delete
+0xdf/0x146 [scsi_transport_fc]
[72135.784980]        [<ffffffff81052918>] run_workqueue+0x112/0x20c
[72135.785192]        [<ffffffff81052afc>] worker_thread+0xea/0xfb
[72135.785402]        [<ffffffff81056d92>] kthread+0x85/0x8d
[72135.785614]        [<ffffffff8100cbfa>] child_rip+0xa/0x20
[72135.785825]        [<ffffffffffffffff>] 0xffffffffffffffff
[72135.786036]
[72135.786036] -> #3 (&shost->scan_mutex){+.+.+.}:
[72135.786447]        [<ffffffff81067002>] check_prev_add+0x146/0x1c7
[72135.786658]        [<ffffffff810674c3>] validate_chain+0x440/0x4da
[72135.786870]        [<ffffffff8106784f>] __lock_acquire+0x2f2/0x363
[72135.787081]        [<ffffffff8106798a>] lock_acquire+0xca/0xee
[72135.787292]        [<ffffffff814f73c7>] __mutex_lock_common
+0x47/0x331
[72135.787503]        [<ffffffff814f7760>] mutex_lock_nested+0x37/0x3c
[72135.787715]        [<ffffffff812a97b3>] scsi_remove_device+0x1e/0x33
[72135.787927]        [<ffffffff812a9871>] __scsi_remove_target
+0x97/0xd5
[72135.788141]        [<ffffffff812a9914>] __remove_child+0x1e/0x25
[72135.788353]        [<ffffffff8129343f>] device_for_each_child
+0x35/0x6c
[72135.788567]        [<ffffffff812a98e9>] scsi_remove_target+0x3a/0x47
[72135.788780]        [<ffffffffa0061de8>] fc_starget_delete+0x21/0x25
[scsi_transport_fc]
[72135.789192]        [<ffffffffa0061ecb>] fc_rport_final_delete
+0xdf/0x146 [scsi_transport_fc]
[72135.789608]        [<ffffffff81052918>] run_workqueue+0x112/0x20c
[72135.789822]        [<ffffffff81052afc>] worker_thread+0xea/0xfb
[72135.790034]        [<ffffffff81056d92>] kthread+0x85/0x8d
[72135.790245]        [<ffffffff8100cbfa>] child_rip+0xa/0x20
[72135.790456]        [<ffffffffffffffff>] 0xffffffffffffffff
[72135.790666]
[72135.790666] -> #2 (&rport->rport_delete_work){+.+.+.}:
[72135.791081]        [<ffffffff81067002>] check_prev_add+0x146/0x1c7
[72135.791293]        [<ffffffff810674c3>] validate_chain+0x440/0x4da
[72135.791506]        [<ffffffff8106784f>] __lock_acquire+0x2f2/0x363
[72135.791721]        [<ffffffff8106798a>] lock_acquire+0xca/0xee
[72135.791932]        [<ffffffff81052912>] run_workqueue+0x10c/0x20c
[72135.792145]        [<ffffffff81052afc>] worker_thread+0xea/0xfb
[72135.792355]        [<ffffffff81056d92>] kthread+0x85/0x8d
[72135.792565]        [<ffffffff8100cbfa>] child_rip+0xa/0x20
[72135.792775]        [<ffffffffffffffff>] 0xffffffffffffffff
[72135.792985]
[72135.792986] -> #1 ((fc_host->work_q_name)){+.+.+.}:
[72135.793402]        [<ffffffff81067002>] check_prev_add+0x146/0x1c7
[72135.793614]        [<ffffffff810674c3>] validate_chain+0x440/0x4da
[72135.793827]        [<ffffffff8106784f>] __lock_acquire+0x2f2/0x363
[72135.794039]        [<ffffffff8106798a>] lock_acquire+0xca/0xee
[72135.794251]        [<ffffffff8105302b>] cleanup_workqueue_thread
+0x45/0xb9
[72135.794464]        [<ffffffff81053165>] destroy_workqueue+0x78/0xab
[72135.794677]        [<ffffffffa00625fd>] fc_remove_host+0x179/0x1a1
[scsi_transport_fc]
[72135.795089]        [<ffffffffa0090f61>] 0xffffffffa0090f61
[72135.795299]        [<ffffffffa0090fe4>] 0xffffffffa0090fe4
[72135.795509]        [<ffffffff81055171>] param_attr_store+0x22/0x30
[72135.795721]        [<ffffffff810551c4>] module_attr_store+0x21/0x25
[72135.795932]        [<ffffffff811284a5>] sysfs_write_file+0xf6/0x132
[72135.796145]        [<ffffffff810d6a57>] vfs_write+0xa9/0x106
[72135.796359]        [<ffffffff810d6b6a>] sys_write+0x45/0x69
[72135.796569]        [<ffffffff8100baeb>] system_call_fastpath
+0x16/0x1b
[72135.796783]        [<ffffffffffffffff>] 0xffffffffffffffff
[72135.796998]
[72135.796998] -> #0 (cpu_add_remove_lock){+.+.+.}:
[72135.797413]        [<ffffffff81066f02>] check_prev_add+0x46/0x1c7
[72135.797626]        [<ffffffff810674c3>] validate_chain+0x440/0x4da
[72135.797837]        [<ffffffff8106784f>] __lock_acquire+0x2f2/0x363
[72135.798051]        [<ffffffff8106798a>] lock_acquire+0xca/0xee
[72135.798264]        [<ffffffff814f73c7>] __mutex_lock_common
+0x47/0x331
[72135.798477]        [<ffffffff814f7760>] mutex_lock_nested+0x37/0x3c
[72135.798689]        [<ffffffff81042d96>] cpu_maps_update_begin
+0x12/0x14
[72135.798902]        [<ffffffff81053119>] destroy_workqueue+0x2c/0xab
[72135.799114]        [<ffffffff812a09df>] scsi_host_dev_release
+0x56/0xb9
[72135.799329]        [<ffffffff81293360>] device_release+0x49/0x76
[72135.799541]        [<ffffffff811d3d83>] kobject_release+0x51/0x66
[72135.799758]        [<ffffffff811d4b05>] kref_put+0x43/0x4d
[72135.799969]        [<ffffffff811d3c8c>] kobject_put+0x47/0x4b
[72135.800181]        [<ffffffff81292c92>] put_device+0x12/0x14
[72135.800392]        [<ffffffffa006137f>] fc_rport_dev_release
+0x14/0x21 [scsi_transport_fc]
[72135.800802]        [<ffffffff81293360>] device_release+0x49/0x76
[72135.801012]        [<ffffffff811d3d83>] kobject_release+0x51/0x66
[72135.801223]        [<ffffffff811d4b05>] kref_put+0x43/0x4d
[72135.801432]        [<ffffffff811d3c8c>] kobject_put+0x47/0x4b
[72135.801642]        [<ffffffff81292c92>] put_device+0x12/0x14
[72135.801852]        [<ffffffff812a6922>] scsi_target_dev_release
+0x1d/0x21
[72135.802067]        [<ffffffff81293360>] device_release+0x49/0x76
[72135.802280]        [<ffffffff811d3d83>] kobject_release+0x51/0x66
[72135.802491]        [<ffffffff811d4b05>] kref_put+0x43/0x4d
[72135.802703]        [<ffffffff811d3c8c>] kobject_put+0x47/0x4b
[72135.802913]        [<ffffffff81292c92>] put_device+0x12/0x14
[72135.803122]        [<ffffffff812a93fb>]
scsi_device_dev_release_usercontext+0x189/0x195
[72135.803532]        [<ffffffff81053aad>] execute_in_process_context
+0x2a/0x70
[72135.803746]        [<ffffffff812a9270>] scsi_device_dev_release
+0x17/0x19
[72135.803958]        [<ffffffff81293360>] device_release+0x49/0x76
[72135.809046]        [<ffffffff811d3d83>] kobject_release+0x51/0x66
[72135.809259]        [<ffffffff811d4b05>] kref_put+0x43/0x4d
[72135.809470]        [<ffffffff811d3c8c>] kobject_put+0x47/0x4b
[72135.809682]        [<ffffffff81292c92>] put_device+0x12/0x14
[72135.809893]        [<ffffffff8129faad>] scsi_device_put+0x3d/0x42
[72135.810105]        [<ffffffffa0001069>] scsi_disk_put+0x2d/0x3e
[sd_mod]
[72135.810319]        [<ffffffffa0001b13>] sd_release+0x4d/0x54 [sd_mod]
[72135.810532]        [<ffffffff810fadb3>] __blkdev_put+0x80/0x138
[72135.810746]        [<ffffffff810fae76>] blkdev_put+0xb/0xd
[72135.810957]        [<ffffffff810faeb1>] blkdev_close+0x39/0x3e
[72135.811170]        [<ffffffff810d746f>] __fput+0x123/0x1d0
[72135.811381]        [<ffffffff810d7531>] fput+0x15/0x17
[72135.811590]        [<ffffffff810d46b5>] filp_close+0x63/0x6d
[72135.811801]        [<ffffffff810d4767>] sys_close+0xa8/0xe2
[72135.812011]        [<ffffffff8100baeb>] system_call_fastpath
+0x16/0x1b
[72135.812225]        [<ffffffffffffffff>] 0xffffffffffffffff
[72135.812435]
[72135.812435] other info that might help us debug this:
[72135.812436]
[72135.813044] 2 locks held by dd/14168:
[72135.813249]  #0:  (&bdev->bd_mutex){+.+.+.}, at: [<ffffffff810fad66>]
__blkdev_put+0x33/0x138
[72135.813668]  #1:  (sd_ref_mutex){+.+.+.}, at: [<ffffffffa0001058>]
scsi_disk_put+0x1c/0x3e [sd_mod]
[72135.814088]



> >    So Joe could you please test this fix in your setup with your
> > rport deletion related fix applied ?
> 
> Yes.  I'll do my tests with this.
> 

Thanks.

> 
> > 
> > Signed-off-by: Vasu Dev <[email protected]>
> > ---
> > 
> >  drivers/scsi/libfc/fc_exch.c |    7 +------
> >  1 files changed, 1 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/scsi/libfc/fc_exch.c b/drivers/scsi/libfc/fc_exch.c
> > index b51db15..9c754d5 100644
> > --- a/drivers/scsi/libfc/fc_exch.c
> > +++ b/drivers/scsi/libfc/fc_exch.c
> > @@ -1446,12 +1446,6 @@ static void fc_exch_reset(struct fc_exch *ep)
> >  
> >     spin_lock_bh(&ep->ex_lock);
> >     ep->state |= FC_EX_RST_CLEANUP;
> > -   /*
> > -    * we really want to call del_timer_sync, but cannot due
> > -    * to the lport calling with the lport lock held (some resp
> > -    * functions can also grab the lport lock which could cause
> > -    * a deadlock).
> > -    */
> >     if (cancel_delayed_work(&ep->timeout_work))
> >             atomic_dec(&ep->ex_refcnt);     /* drop hold for timer */
> >     resp = ep->resp;
> > @@ -1898,6 +1892,7 @@ void fc_exch_mgr_free(struct fc_lport *lport)
> >  {
> >     struct fc_exch_mgr_anchor *ema, *next;
> >  
> > +   flush_scheduled_work();
> >     list_for_each_entry_safe(ema, next, &lport->ema_list, ema_list)
> >             fc_exch_mgr_del(ema);
> >  }
> > 
> > _______________________________________________
> > devel mailing list
> > [email protected]
> > http://www.open-fcoe.org/mailman/listinfo/devel
> 
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-fcoe.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
[email protected]
http://www.open-fcoe.org/mailman/listinfo/devel

Reply via email to