I observed a system hang when the discovery fails due to lack of xid
resources. This issue can happen when number of NPIV ports created are
greater than the xids allocated per pool -- for eg., creating 255 NPIV
ports on a system with nr_cpu_ids of 32, with each pool containing 128
xids -- and then generating a link event - shutdown/no shutdown on the
switch port causes the hang. Attached is the stack trace of the threads
in blocked state.
>From the stack trace,
1. disc_work that was scheduled as part of the 3rd retry attempt calls
fc_disc_done with DISC_EV_FAILED.
2. disc_callback is called to complete the discovery with a failure.
3. holding lp_mutex fc_lport_reset_locked calls disc_stop
4. disc_stop tries to cancel the disc_work from the same work by calling
cancel_delayed_work_sync(). (is it trying to cancel itself?)
Please let me know your thoughts on this.
Thanks,
Bhanu
ul 18 17:54:42 R710-41-67 kernel: [40940.244008] SysRq : Show Blocked State
Jul 18 17:54:42 R710-41-67 kernel: [40940.247756] task
PC stack pid father
Jul 18 17:54:42 R710-41-67 kernel: [40940.247761] events/0 D
00000000ffffffff 0 35 2 0x00000000
Jul 18 17:54:42 R710-41-67 kernel: [40940.247766] ffff880328b89bd0
0000000000000046 0000666666666663 ffff880328b86400
Jul 18 17:54:42 R710-41-67 kernel: [40940.247770] 0000000000013600
ffff880328b89fd8 0000000000013600 ffff880328b89fd8
Jul 18 17:54:42 R710-41-67 kernel: [40940.247773] 0000000000013600
0000000000013600 0000000000013600 0000000000013600
Jul 18 17:54:42 R710-41-67 kernel: [40940.247777] Call Trace:
Jul 18 17:54:42 R710-41-67 kernel: [40940.247791] [<ffffffff81393d3d>]
schedule_timeout+0x19d/0x230
Jul 18 17:54:42 R710-41-67 kernel: [40940.247799] [<ffffffff81392fb0>]
wait_for_common+0xc0/0x170
Jul 18 17:54:42 R710-41-67 kernel: [40940.247806] [<ffffffff810601af>]
__cancel_work_timer+0xcf/0x1b0
Jul 18 17:54:42 R710-41-67 kernel: [40940.247817] [<ffffffffa047a386>]
fc_disc_stop+0x16/0x30 [libfc2]
Jul 18 17:54:42 R710-41-67 kernel: [40940.247831] [<ffffffffa047f8e7>]
fc_lport_reset_locked+0x47/0x90 [libfc2]
Jul 18 17:54:42 R710-41-67 kernel: [40940.247843] [<ffffffffa0480397>]
fc_lport_enter_reset+0x67/0xe0 [libfc2]
Jul 18 17:54:42 R710-41-67 kernel: [40940.247854] [<ffffffffa048110c>]
fc_lport_disc_callback+0xbc/0xe0 [libfc2]
Jul 18 17:54:42 R710-41-67 kernel: [40940.247865] [<ffffffffa047a5a8>]
fc_disc_done+0xa8/0xf0 [libfc2]
Jul 18 17:54:42 R710-41-67 kernel: [40940.247872] [<ffffffffa047a489>]
fc_disc_timeout+0x29/0x40 [libfc2]
Jul 18 17:54:42 R710-41-67 kernel: [40940.247878] [<ffffffff8105f558>]
run_workqueue+0xb8/0x140
Jul 18 17:54:42 R710-41-67 kernel: [40940.247883] [<ffffffff8105f676>]
worker_thread+0x96/0x110
Jul 18 17:54:42 R710-41-67 kernel: [40940.247890] [<ffffffff810636f6>]
kthread+0x96/0xa0
Jul 18 17:54:42 R710-41-67 kernel: [40940.247895] [<ffffffff81003fba>]
child_rip+0xa/0x20
Jul 18 17:54:42 R710-41-67 kernel: [40940.247900] events/2 D
00000000ffffffff 0 37 2 0x00000000
Jul 18 17:54:42 R710-41-67 kernel: [40940.247904] ffff880328b91c48
0000000000000046 ffffffff811db3e5 ffff880328b8e480
Jul 18 17:54:42 R710-41-67 kernel: [40940.247907] 0000000000013600
ffff880328b91fd8 0000000000013600 ffff880328b91fd8
Jul 18 17:54:42 R710-41-67 kernel: [40940.247910] 0000000000013600
0000000000013600 0000000000013600 0000000000013600
Jul 18 17:54:42 R710-41-67 kernel: [40940.247914] Call Trace:
Jul 18 17:54:42 R710-41-67 kernel: [40940.247919] [<ffffffff81394657>]
__mutex_lock_slowpath+0xe7/0x170
Jul 18 17:54:42 R710-41-67 kernel: [40940.247923] [<ffffffff813940aa>]
mutex_lock+0x1a/0x40
Jul 18 17:54:42 R710-41-67 kernel: [40940.247931] [<ffffffffa0488b58>]
fc_vport_id_lookup2+0x28/0x90 [libfc2]
Jul 18 17:54:42 R710-41-67 kernel: [40940.247946] [<ffffffffa049f469>]
fcoe_ctlr_recv_work+0xe19/0x1188 [libfcoe2]
Jul 18 17:54:42 R710-41-67 kernel: [40940.247952] [<ffffffff8105f558>]
run_workqueue+0xb8/0x140
Jul 18 17:54:42 R710-41-67 kernel: [40940.247957] [<ffffffff8105f676>]
worker_thread+0x96/0x110
Jul 18 17:54:42 R710-41-67 kernel: [40940.247962] [<ffffffff810636f6>]
kthread+0x96/0xa0
Jul 18 17:54:42 R710-41-67 kernel: [40940.247966] [<ffffffff81003fba>]
child_rip+0xa/0x20
_______________________________________________
devel mailing list
[email protected]
http://www.open-fcoe.org/mailman/listinfo/devel