Hi Mike,
Regarding the analysis of waiting-tasks.txt, I'd like to add some information
about the issue:
After running the "rescan" that gets stuck, any other iscsiadm command we try
running - gets stuck too. I'm guessing that this is because the first command
locks some mutex that all other commands wait on. (I didn't verify this, it's
mainly a guess)
I'm guessing that the command you analyzed in the previous mail is not the
original "rescan" but something else that was run later, or in other words, I
think it's not the root cause.
Here are 2 example stack traces (from 2 different runs) that I believe belong
to the stuck iscsiadm process:
{
iscsiadm R running task 0 4597 1 0x00000004
ffff88009bab38b8 ffffffff815ec84c ffff88009bab37d8 ffff88009bab2010
0000000000004000 ffff88009ba1d5c0 ffff88009bab3fd8 ffff88009bab3fd8
0000000000000000 ffff88009ba1d5c0 ffff88013b088c40 ffff88009ba1d858
Call Trace:
[<ffffffff815ec84c>] ? __schedule+0x3fc/0x8d0
[<ffffffff8103c765>] ? wake_up_process+0x15/0x20
[<ffffffff810488b5>] ? raise_softirq_irqoff+0x95/0xb0
[<ffffffff81279536>] ? number+0x316/0x340
[<ffffffff8126df97>] ? cfq_set_request+0x257/0x470
[<ffffffffa0028002>] ? iscsi_queuecommand+0x102/0x4f0 [libiscsi]
[<ffffffff81051c4b>] ? lock_timer_base+0x3b/0x70
[<ffffffff810e1d3d>] free_percpu+0x9d/0x130
[<ffffffff810e22b9>] ? pcpu_alloc+0x1e9/0x900
[<ffffffff81273bda>] ? kobject_release+0x4a/0xa0
[<ffffffff8127400d>] ? kobject_set_name_vargs+0x6d/0x80
[<ffffffff8112e0b6>] ? bio_phys_segments+0x26/0x30
[<ffffffff8125ca20>] ? blk_rq_bio_prep+0x30/0xc0
[<ffffffff8125f795>] ? blk_put_queue+0x15/0x20
[<ffffffff81310a8f>] ? scsi_device_dev_release_usercontext+0x17f/0x1d0
[<ffffffff8125f61c>] ? blk_alloc_queue_node+0x6c/0x1b0
[<ffffffff8125fcb4>] ? blk_init_queue_node+0x34/0x80
[<ffffffff812739b7>] ? kobject_init+0x37/0xa0
[<ffffffff8130cd26>] ? scsi_execute_req+0xa6/0x140
[<ffffffff8130e0f9>] ? scsi_probe_and_add_lun+0x289/0xce0
[<ffffffff810cbb5d>] ? free_one_page+0x20d/0x410
[<ffffffff815f01ce>] ? apic_timer_interrupt+0xe/0x20
[<ffffffff81273bed>] ? kobject_release+0x5d/0xa0
[<ffffffff8130f168>] ? __scsi_scan_target+0x3d8/0x620
[<ffffffff81279d5b>] ? vsscanf+0x2ab/0x790
[<ffffffff8130fa28>] ? scsi_scan_target+0xc8/0xd0
[<ffffffffa00111b6>] ? iscsi_user_scan_session+0x156/0x1a0
[scsi_transport_iscsi]
[<ffffffffa0011060>] ? iscsi_host_for_each_session+0x20/0x20
[scsi_transport_iscsi]
[<ffffffff812dcc73>] ? device_for_each_child+0x53/0x80
[<ffffffffa001103d>] ? iscsi_user_scan+0x2d/0x30 [scsi_transport_iscsi]
[<ffffffff813105cc>] ? store_scan+0xec/0x120
[<ffffffff8110b568>] ? do_filp_open+0x48/0xa0
[<ffffffff812dbdc0>] ? dev_attr_store+0x20/0x30
[<ffffffff81162f31>] ? sysfs_write_file+0xd1/0x150
[<ffffffff810fcf3b>] ? vfs_write+0xcb/0x170
[<ffffffff810fd0e5>] ? sys_write+0x55/0x90
[<ffffffff815ef86b>] ? system_call_fastpath+0x16/0x1b
}
{
iscsiadm R running task 0 2845 1 0x00000004
ffff8800682078b8 ffffffff815ec84c ffff8800682077d8 ffff880068206010
0000000000004000 ffff8800680e8c40 ffff880068207fd8 ffff880068207fd8
0000000000000000 ffff8800680e8c40 ffff88013b088c40 ffff8800680e8ed8
Call Trace:
[<ffffffff815ec84c>] ? __schedule+0x3fc/0x8d0
[<ffffffff8103c765>] ? wake_up_process+0x15/0x20
[<ffffffff810488b5>] ? raise_softirq_irqoff+0x95/0xb0
[<ffffffff81279536>] ? number+0x316/0x340
[<ffffffff8126df97>] ? cfq_set_request+0x257/0x470
[<ffffffff8130c02d>] ? scsi_setup_blk_pc_cmnd+0x12d/0x160
[<ffffffff81304eb1>] ? scsi_dispatch_cmd+0xf1/0x210
[<ffffffff815ecdc5>] ? _cond_resched+0x35/0x50
[<ffffffff815ece1b>] ? wait_for_common+0x3b/0x180
[<ffffffff8128ae8a>] ? percpu_counter_destroy+0x1a/0x30
[<ffffffff812760c9>] ? prop_local_destroy_percpu+0x9/0x10
[<ffffffff810e22b9>] ? pcpu_alloc+0x1e9/0x900
[<ffffffff81273bda>] ? kobject_release+0x4a/0xa0
[<ffffffff81273fcd>] ? kobject_set_name_vargs+0x2d/0x80
[<ffffffff812dca69>] ? dev_set_name+0x69/0x70
[<ffffffff8125ca20>] ? blk_rq_bio_prep+0x30/0xc0
[<ffffffff8125f795>] ? blk_put_queue+0x15/0x20
[<ffffffff81310a8f>] ? scsi_device_dev_release_usercontext+0x17f/0x1d0
[<ffffffff8125f61c>] ? blk_alloc_queue_node+0x6c/0x1b0
[<ffffffff8125fcb4>] ? blk_init_queue_node+0x34/0x80
[<ffffffff8125fd13>] ? blk_init_queue+0x13/0x20
[<ffffffff812e3e55>] ? transport_setup_device+0x15/0x20
[<ffffffff8130cd26>] ? scsi_execute_req+0xa6/0x140
[<ffffffff8130e0f9>] ? scsi_probe_and_add_lun+0x289/0xce0
[<ffffffff810cbb5d>] ? free_one_page+0x20d/0x410
[<ffffffff815f01ce>] ? apic_timer_interrupt+0xe/0x20
[<ffffffff81273bed>] ? kobject_release+0x5d/0xa0
[<ffffffff8130f168>] ? __scsi_scan_target+0x3d8/0x620
[<ffffffff81279d5b>] ? vsscanf+0x2ab/0x790
[<ffffffff8130fa28>] ? scsi_scan_target+0xc8/0xd0
[<ffffffffa00111b6>] ? iscsi_user_scan_session+0x156/0x1a0
[scsi_transport_iscsi]
[<ffffffffa0011060>] ? iscsi_host_for_each_session+0x20/0x20
[scsi_transport_iscsi]
[<ffffffff812dcc73>] ? device_for_each_child+0x53/0x80
[<ffffffffa001103d>] ? iscsi_user_scan+0x2d/0x30 [scsi_transport_iscsi]
[<ffffffff813105cc>] ? store_scan+0xec/0x120
[<ffffffff8110b568>] ? do_filp_open+0x48/0xa0
[<ffffffff812dbdc0>] ? dev_attr_store+0x20/0x30
[<ffffffff81162f31>] ? sysfs_write_file+0xd1/0x150
[<ffffffff810fcf3b>] ? vfs_write+0xcb/0x170
[<ffffffff810fd0e5>] ? sys_write+0x55/0x90
[<ffffffff815ef86b>] ? system_call_fastpath+0x16/0x1b
}
They have a lot of similarities but also differences. I provided 2 traces
hoping that it will be more informative for you.
They were obtained by running:
echo t > /proc/sysrq-trigger
dmesg > dmesg.txt
I know that it's the stuck process because I ran "killall iscsid" first, and
because "dmesg.txt" only contains one "iscsiadm" process.
Does this make sense?
-----Original Message-----
From: Cale, Yonatan
Sent: Wednesday, April 16, 2014 7:22 PM
To: 'Mike Christie'
Cc: [email protected]; [email protected]
Subject: RE: Target reboot -> iscsiadm rescan Stuck
Hi,
Attached is a compressed tcpdump capture (iscsiadm_spreboot_3.gz, ~300kb), Some
detailes about it:
It was obtained by running " tcpdump icmp and src or dst 10.76.18.174 or src or
dst 10.76.18.175 -i any -w iscsiadm_spreboot_3".
My local iscsi port IPs are 10.76.18.174 & 10.76.18.175.
The target iscsi IPs are 10.76.18.31 & 10.76.18.33.
Chronological order:
-Start capture
-A few seconds later, reboot SP A (one of the iscsi targets) -A long time (~10
minutes) later, the bug happens (I notice it because iscsiadm starts taking 85%
CPU) -A few seconds later, stop capture
I'm mentioning the order mainly to point out that the bug happens in the last
few seconds of the capture
And thank you for sitting on this bug with me.
:)
-----Original Message-----
From: Mike Christie [mailto:[email protected]]
Sent: Monday, April 14, 2014 8:07 PM
To: Cale, Yonatan
Cc: [email protected]; [email protected]
Subject: Re: Target reboot -> iscsiadm rescan Stuck
On 04/13/2014 07:29 AM, Cale, Yonatan wrote:
> # cat waiting-tasks.txt
> SysRq : Show Blocked State
> task PC stack pid father
> iscsid D 0000000000000000 0 2842 1 0x00000000
> ffff880137f83918 0000000000000086 ffff88010e3be2d0 ffff880137f82010
> 0000000000004000 ffff88013b2b8c40 ffff880137f83fd8 ffff880137f83fd8
> 0000000000000000 ffff88013b2b8c40 ffffffff81a0b020 ffff88013b2b8ed0
> Call Trace:
> [<ffffffff81273a97>] ? kobject_put+0x27/0x60 [<ffffffff812dc447>] ?
> put_device+0x17/0x20 [<ffffffff8103418f>] ? complete+0x4f/0x60
> [<ffffffff815ed07f>] schedule+0x3f/0x60 [<ffffffff815eda02>]
> __mutex_lock_slowpath+0x102/0x180 [<ffffffff815edf0b>]
> mutex_lock+0x2b/0x50 [<ffffffffa0011a97>]
> __iscsi_unbind_session+0x67/0x160 [scsi_transport_iscsi]
> [<ffffffffa0011ca1>] iscsi_remove_session+0x111/0x1f0
> [scsi_transport_iscsi] [<ffffffffa0011d96>]
> iscsi_destroy_session+0x16/0x60 [scsi_transport_iscsi]
> [<ffffffffa002573d>] iscsi_session_teardown+0x9d/0xd0 [libiscsi]
> [<ffffffffa0032300>] iscsi_sw_tcp_session_destroy+0x50/0x70
> [iscsi_tcp] [<ffffffffa0012c9d>] iscsi_if_rx+0x7dd/0xaa0
> [scsi_transport_iscsi] [<ffffffff814f50ee>]
> netlink_unicast+0x2ae/0x2c0 [<ffffffff814d11dc>] ?
> memcpy_fromiovec+0x7c/0xa0 [<ffffffff814f5aae>]
> netlink_sendmsg+0x33e/0x380 [<ffffffff814c55f8>]
> sock_sendmsg+0xe8/0x120 [<ffffffff811078bf>] ? do_lookup+0xcf/0x360
> [<ffffffff8111ad6f>] ? mntput+0x1f/0x40 [<ffffffff81107012>] ?
> path_put+0x22/0x30 [<ffffffff814c512b>] ?
> move_addr_to_kernel+0x6b/0x70 [<ffffffff814d13a1>] ?
> verify_iovec+0x51/0x100 [<ffffffff814c683f>]
> __sys_sendmsg+0x3df/0x3f0 [<ffffffff810f8419>] ?
> kmem_cache_free+0xe9/0xf0 [<ffffffff81100dac>] ?
> cp_new_stat+0xfc/0x120 [<ffffffff814c6a79>] sys_sendmsg+0x49/0x80
> [<ffffffff815ef86b>] system_call_fastpath+0x16/0x1b
Ok, that was more interesting than I expected. If you are not running a
iscsiadm logout command, then it looks like the target returned a error code
indicating that target is not coming back. iscsid handled this by trying to
remove the target and we are stuck in there. This and the scan command should
be fast at this time because it looks like the replacement/recovery timeout has
expired:
session1: session recovery timed out after 120 secs
session2: session recovery timed out after 120 secs
At this time, all IO is just fast failed so we are not waiting for IO and any
time we need that mutex it should be quick.
1. Can you send me a tcpdump trace? I am guessing that you do not want the
session to be deleted during this time, so we need the trace to see what error
code is leading to this.
2. I am wondering if the hang is because the device is offlined when this
happened. I did not try that case. I will setup that kernel and try here.
--
You received this message because you are subscribed to the Google Groups
"open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.