I observed a system hang when the discovery fails due to lack of xid
resources. This issue can happen when number of NPIV ports created are
greater than the xids allocated per pool -- for eg., creating 255 NPIV
ports on a system with nr_cpu_ids of 32, with each pool containing 128
xids -- and then generating a link event - shutdown/no shutdown on the
switch port causes the hang. Attached is the stack trace of the threads
in blocked state.

>From the stack trace,

1. disc_work that was scheduled as part of the 3rd retry attempt calls
fc_disc_done with DISC_EV_FAILED.
2. disc_callback is called to complete the discovery with a failure.
3. holding lp_mutex fc_lport_reset_locked calls disc_stop
4. disc_stop tries to cancel the disc_work from the same work by calling
cancel_delayed_work_sync(). (is it trying to cancel itself?)

Please let me know your thoughts on this.

Thanks,
Bhanu


ul 18 17:54:42 R710-41-67 kernel: [40940.244008] SysRq : Show Blocked State
Jul 18 17:54:42 R710-41-67 kernel: [40940.247756]   task                        
PC stack   pid father
Jul 18 17:54:42 R710-41-67 kernel: [40940.247761] events/0      D 
00000000ffffffff     0    35      2 0x00000000
Jul 18 17:54:42 R710-41-67 kernel: [40940.247766]  ffff880328b89bd0 
0000000000000046 0000666666666663 ffff880328b86400
Jul 18 17:54:42 R710-41-67 kernel: [40940.247770]  0000000000013600 
ffff880328b89fd8 0000000000013600 ffff880328b89fd8
Jul 18 17:54:42 R710-41-67 kernel: [40940.247773]  0000000000013600 
0000000000013600 0000000000013600 0000000000013600
Jul 18 17:54:42 R710-41-67 kernel: [40940.247777] Call Trace:
Jul 18 17:54:42 R710-41-67 kernel: [40940.247791]  [<ffffffff81393d3d>] 
schedule_timeout+0x19d/0x230
Jul 18 17:54:42 R710-41-67 kernel: [40940.247799]  [<ffffffff81392fb0>] 
wait_for_common+0xc0/0x170
Jul 18 17:54:42 R710-41-67 kernel: [40940.247806]  [<ffffffff810601af>] 
__cancel_work_timer+0xcf/0x1b0
Jul 18 17:54:42 R710-41-67 kernel: [40940.247817]  [<ffffffffa047a386>] 
fc_disc_stop+0x16/0x30 [libfc2]
Jul 18 17:54:42 R710-41-67 kernel: [40940.247831]  [<ffffffffa047f8e7>] 
fc_lport_reset_locked+0x47/0x90 [libfc2]
Jul 18 17:54:42 R710-41-67 kernel: [40940.247843]  [<ffffffffa0480397>] 
fc_lport_enter_reset+0x67/0xe0 [libfc2]
Jul 18 17:54:42 R710-41-67 kernel: [40940.247854]  [<ffffffffa048110c>] 
fc_lport_disc_callback+0xbc/0xe0 [libfc2]
Jul 18 17:54:42 R710-41-67 kernel: [40940.247865]  [<ffffffffa047a5a8>] 
fc_disc_done+0xa8/0xf0 [libfc2]
Jul 18 17:54:42 R710-41-67 kernel: [40940.247872]  [<ffffffffa047a489>] 
fc_disc_timeout+0x29/0x40 [libfc2]
Jul 18 17:54:42 R710-41-67 kernel: [40940.247878]  [<ffffffff8105f558>] 
run_workqueue+0xb8/0x140
Jul 18 17:54:42 R710-41-67 kernel: [40940.247883]  [<ffffffff8105f676>] 
worker_thread+0x96/0x110
Jul 18 17:54:42 R710-41-67 kernel: [40940.247890]  [<ffffffff810636f6>] 
kthread+0x96/0xa0
Jul 18 17:54:42 R710-41-67 kernel: [40940.247895]  [<ffffffff81003fba>] 
child_rip+0xa/0x20
Jul 18 17:54:42 R710-41-67 kernel: [40940.247900] events/2      D 
00000000ffffffff     0    37      2 0x00000000
Jul 18 17:54:42 R710-41-67 kernel: [40940.247904]  ffff880328b91c48 
0000000000000046 ffffffff811db3e5 ffff880328b8e480
Jul 18 17:54:42 R710-41-67 kernel: [40940.247907]  0000000000013600 
ffff880328b91fd8 0000000000013600 ffff880328b91fd8
Jul 18 17:54:42 R710-41-67 kernel: [40940.247910]  0000000000013600 
0000000000013600 0000000000013600 0000000000013600
Jul 18 17:54:42 R710-41-67 kernel: [40940.247914] Call Trace:
Jul 18 17:54:42 R710-41-67 kernel: [40940.247919]  [<ffffffff81394657>] 
__mutex_lock_slowpath+0xe7/0x170
Jul 18 17:54:42 R710-41-67 kernel: [40940.247923]  [<ffffffff813940aa>] 
mutex_lock+0x1a/0x40
Jul 18 17:54:42 R710-41-67 kernel: [40940.247931]  [<ffffffffa0488b58>] 
fc_vport_id_lookup2+0x28/0x90 [libfc2]
Jul 18 17:54:42 R710-41-67 kernel: [40940.247946]  [<ffffffffa049f469>] 
fcoe_ctlr_recv_work+0xe19/0x1188 [libfcoe2]
Jul 18 17:54:42 R710-41-67 kernel: [40940.247952]  [<ffffffff8105f558>] 
run_workqueue+0xb8/0x140
Jul 18 17:54:42 R710-41-67 kernel: [40940.247957]  [<ffffffff8105f676>] 
worker_thread+0x96/0x110
Jul 18 17:54:42 R710-41-67 kernel: [40940.247962]  [<ffffffff810636f6>] 
kthread+0x96/0xa0
Jul 18 17:54:42 R710-41-67 kernel: [40940.247966]  [<ffffffff81003fba>] 
child_rip+0xa/0x20
_______________________________________________
devel mailing list
[email protected]
http://www.open-fcoe.org/mailman/listinfo/devel

Reply via email to