Hi Mike,

Regarding the analysis of waiting-tasks.txt, I'd like to add some information 
about the issue:

After running the "rescan" that gets stuck, any other iscsiadm command we try 
running - gets stuck too. I'm guessing that this is because the first command 
locks some mutex that all other commands wait on. (I didn't verify this, it's 
mainly a guess)

I'm guessing that the command you analyzed in the previous mail is not the 
original "rescan" but something else that was run later, or in other words, I 
think it's not the root cause. 

Here are 2 example stack traces (from 2 different runs) that I believe belong 
to the stuck iscsiadm process:

{
iscsiadm        R  running task        0  4597      1 0x00000004
 ffff88009bab38b8 ffffffff815ec84c ffff88009bab37d8 ffff88009bab2010
 0000000000004000 ffff88009ba1d5c0 ffff88009bab3fd8 ffff88009bab3fd8
 0000000000000000 ffff88009ba1d5c0 ffff88013b088c40 ffff88009ba1d858
Call Trace:
 [<ffffffff815ec84c>] ? __schedule+0x3fc/0x8d0
 [<ffffffff8103c765>] ? wake_up_process+0x15/0x20
 [<ffffffff810488b5>] ? raise_softirq_irqoff+0x95/0xb0
 [<ffffffff81279536>] ? number+0x316/0x340
 [<ffffffff8126df97>] ? cfq_set_request+0x257/0x470
 [<ffffffffa0028002>] ? iscsi_queuecommand+0x102/0x4f0 [libiscsi]
 [<ffffffff81051c4b>] ? lock_timer_base+0x3b/0x70
 [<ffffffff810e1d3d>] free_percpu+0x9d/0x130
 [<ffffffff810e22b9>] ? pcpu_alloc+0x1e9/0x900
 [<ffffffff81273bda>] ? kobject_release+0x4a/0xa0
 [<ffffffff8127400d>] ? kobject_set_name_vargs+0x6d/0x80
 [<ffffffff8112e0b6>] ? bio_phys_segments+0x26/0x30
 [<ffffffff8125ca20>] ? blk_rq_bio_prep+0x30/0xc0
 [<ffffffff8125f795>] ? blk_put_queue+0x15/0x20
 [<ffffffff81310a8f>] ? scsi_device_dev_release_usercontext+0x17f/0x1d0
 [<ffffffff8125f61c>] ? blk_alloc_queue_node+0x6c/0x1b0
 [<ffffffff8125fcb4>] ? blk_init_queue_node+0x34/0x80
 [<ffffffff812739b7>] ? kobject_init+0x37/0xa0
 [<ffffffff8130cd26>] ? scsi_execute_req+0xa6/0x140
 [<ffffffff8130e0f9>] ? scsi_probe_and_add_lun+0x289/0xce0
 [<ffffffff810cbb5d>] ? free_one_page+0x20d/0x410
 [<ffffffff815f01ce>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffff81273bed>] ? kobject_release+0x5d/0xa0
 [<ffffffff8130f168>] ? __scsi_scan_target+0x3d8/0x620
 [<ffffffff81279d5b>] ? vsscanf+0x2ab/0x790
 [<ffffffff8130fa28>] ? scsi_scan_target+0xc8/0xd0
 [<ffffffffa00111b6>] ? iscsi_user_scan_session+0x156/0x1a0 
[scsi_transport_iscsi]
 [<ffffffffa0011060>] ? iscsi_host_for_each_session+0x20/0x20 
[scsi_transport_iscsi]
 [<ffffffff812dcc73>] ? device_for_each_child+0x53/0x80
 [<ffffffffa001103d>] ? iscsi_user_scan+0x2d/0x30 [scsi_transport_iscsi]
 [<ffffffff813105cc>] ? store_scan+0xec/0x120
 [<ffffffff8110b568>] ? do_filp_open+0x48/0xa0
 [<ffffffff812dbdc0>] ? dev_attr_store+0x20/0x30
 [<ffffffff81162f31>] ? sysfs_write_file+0xd1/0x150
 [<ffffffff810fcf3b>] ? vfs_write+0xcb/0x170
 [<ffffffff810fd0e5>] ? sys_write+0x55/0x90
 [<ffffffff815ef86b>] ? system_call_fastpath+0x16/0x1b
}

{
iscsiadm        R  running task        0  2845      1 0x00000004
ffff8800682078b8 ffffffff815ec84c ffff8800682077d8 ffff880068206010
 0000000000004000 ffff8800680e8c40 ffff880068207fd8 ffff880068207fd8
 0000000000000000 ffff8800680e8c40 ffff88013b088c40 ffff8800680e8ed8
Call Trace:
 [<ffffffff815ec84c>] ? __schedule+0x3fc/0x8d0
 [<ffffffff8103c765>] ? wake_up_process+0x15/0x20
 [<ffffffff810488b5>] ? raise_softirq_irqoff+0x95/0xb0
 [<ffffffff81279536>] ? number+0x316/0x340
 [<ffffffff8126df97>] ? cfq_set_request+0x257/0x470
 [<ffffffff8130c02d>] ? scsi_setup_blk_pc_cmnd+0x12d/0x160
 [<ffffffff81304eb1>] ? scsi_dispatch_cmd+0xf1/0x210
 [<ffffffff815ecdc5>] ? _cond_resched+0x35/0x50
 [<ffffffff815ece1b>] ? wait_for_common+0x3b/0x180
 [<ffffffff8128ae8a>] ? percpu_counter_destroy+0x1a/0x30
 [<ffffffff812760c9>] ? prop_local_destroy_percpu+0x9/0x10
 [<ffffffff810e22b9>] ? pcpu_alloc+0x1e9/0x900
 [<ffffffff81273bda>] ? kobject_release+0x4a/0xa0
 [<ffffffff81273fcd>] ? kobject_set_name_vargs+0x2d/0x80
 [<ffffffff812dca69>] ? dev_set_name+0x69/0x70
 [<ffffffff8125ca20>] ? blk_rq_bio_prep+0x30/0xc0
 [<ffffffff8125f795>] ? blk_put_queue+0x15/0x20
 [<ffffffff81310a8f>] ? scsi_device_dev_release_usercontext+0x17f/0x1d0
 [<ffffffff8125f61c>] ? blk_alloc_queue_node+0x6c/0x1b0
 [<ffffffff8125fcb4>] ? blk_init_queue_node+0x34/0x80
 [<ffffffff8125fd13>] ? blk_init_queue+0x13/0x20
 [<ffffffff812e3e55>] ? transport_setup_device+0x15/0x20
 [<ffffffff8130cd26>] ? scsi_execute_req+0xa6/0x140
 [<ffffffff8130e0f9>] ? scsi_probe_and_add_lun+0x289/0xce0
 [<ffffffff810cbb5d>] ? free_one_page+0x20d/0x410
 [<ffffffff815f01ce>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffff81273bed>] ? kobject_release+0x5d/0xa0
 [<ffffffff8130f168>] ? __scsi_scan_target+0x3d8/0x620
 [<ffffffff81279d5b>] ? vsscanf+0x2ab/0x790
 [<ffffffff8130fa28>] ? scsi_scan_target+0xc8/0xd0
 [<ffffffffa00111b6>] ? iscsi_user_scan_session+0x156/0x1a0 
[scsi_transport_iscsi]
 [<ffffffffa0011060>] ? iscsi_host_for_each_session+0x20/0x20 
[scsi_transport_iscsi]
 [<ffffffff812dcc73>] ? device_for_each_child+0x53/0x80
 [<ffffffffa001103d>] ? iscsi_user_scan+0x2d/0x30 [scsi_transport_iscsi]
 [<ffffffff813105cc>] ? store_scan+0xec/0x120
 [<ffffffff8110b568>] ? do_filp_open+0x48/0xa0
 [<ffffffff812dbdc0>] ? dev_attr_store+0x20/0x30
 [<ffffffff81162f31>] ? sysfs_write_file+0xd1/0x150
 [<ffffffff810fcf3b>] ? vfs_write+0xcb/0x170
 [<ffffffff810fd0e5>] ? sys_write+0x55/0x90
 [<ffffffff815ef86b>] ? system_call_fastpath+0x16/0x1b
}
They have a lot of similarities but also differences. I provided 2 traces 
hoping that it will be more informative for you.

They were obtained by running:
echo t > /proc/sysrq-trigger
dmesg > dmesg.txt

I know that it's the stuck process because I ran "killall iscsid" first, and 
because "dmesg.txt" only contains one "iscsiadm" process.

Does this make sense?

-----Original Message-----
From: Cale, Yonatan 
Sent: Wednesday, April 16, 2014 7:22 PM
To: 'Mike Christie'
Cc: [email protected]; [email protected]
Subject: RE: Target reboot -> iscsiadm rescan Stuck

Hi,

Attached is a compressed tcpdump capture (iscsiadm_spreboot_3.gz, ~300kb), Some 
detailes about it:
It was obtained by running " tcpdump icmp and src or dst 10.76.18.174 or src or 
dst 10.76.18.175 -i any -w iscsiadm_spreboot_3".
My local iscsi port IPs are 10.76.18.174 & 10.76.18.175.
The target iscsi IPs are 10.76.18.31 & 10.76.18.33.
Chronological order:
-Start capture
-A few seconds later, reboot SP A (one of the iscsi targets) -A long time (~10 
minutes) later, the bug happens (I notice it because iscsiadm starts taking 85% 
CPU) -A few seconds later, stop capture

I'm mentioning the order mainly to point out that the bug happens in the last 
few seconds of the capture

And thank you for sitting on this bug with me.
:)

-----Original Message-----
From: Mike Christie [mailto:[email protected]]
Sent: Monday, April 14, 2014 8:07 PM
To: Cale, Yonatan
Cc: [email protected]; [email protected]
Subject: Re: Target reboot -> iscsiadm rescan Stuck

On 04/13/2014 07:29 AM, Cale, Yonatan wrote:

> # cat waiting-tasks.txt
> SysRq : Show Blocked State
>   task                        PC stack   pid father
> iscsid          D 0000000000000000     0  2842      1 0x00000000
>  ffff880137f83918 0000000000000086 ffff88010e3be2d0 ffff880137f82010
>  0000000000004000 ffff88013b2b8c40 ffff880137f83fd8 ffff880137f83fd8
>  0000000000000000 ffff88013b2b8c40 ffffffff81a0b020 ffff88013b2b8ed0 
> Call Trace:
>  [<ffffffff81273a97>] ? kobject_put+0x27/0x60  [<ffffffff812dc447>] ? 
> put_device+0x17/0x20  [<ffffffff8103418f>] ? complete+0x4f/0x60 
> [<ffffffff815ed07f>] schedule+0x3f/0x60  [<ffffffff815eda02>]
> __mutex_lock_slowpath+0x102/0x180  [<ffffffff815edf0b>]
> mutex_lock+0x2b/0x50  [<ffffffffa0011a97>]
> __iscsi_unbind_session+0x67/0x160 [scsi_transport_iscsi] 
> [<ffffffffa0011ca1>] iscsi_remove_session+0x111/0x1f0 
> [scsi_transport_iscsi]  [<ffffffffa0011d96>]
> iscsi_destroy_session+0x16/0x60 [scsi_transport_iscsi] 
> [<ffffffffa002573d>] iscsi_session_teardown+0x9d/0xd0 [libiscsi] 
> [<ffffffffa0032300>] iscsi_sw_tcp_session_destroy+0x50/0x70
> [iscsi_tcp]  [<ffffffffa0012c9d>] iscsi_if_rx+0x7dd/0xaa0 
> [scsi_transport_iscsi]  [<ffffffff814f50ee>]
> netlink_unicast+0x2ae/0x2c0  [<ffffffff814d11dc>] ? 
> memcpy_fromiovec+0x7c/0xa0  [<ffffffff814f5aae>]
> netlink_sendmsg+0x33e/0x380  [<ffffffff814c55f8>]
> sock_sendmsg+0xe8/0x120  [<ffffffff811078bf>] ? do_lookup+0xcf/0x360 
> [<ffffffff8111ad6f>] ? mntput+0x1f/0x40  [<ffffffff81107012>] ?
> path_put+0x22/0x30  [<ffffffff814c512b>] ? 
> move_addr_to_kernel+0x6b/0x70  [<ffffffff814d13a1>] ? 
> verify_iovec+0x51/0x100  [<ffffffff814c683f>]
> __sys_sendmsg+0x3df/0x3f0  [<ffffffff810f8419>] ? 
> kmem_cache_free+0xe9/0xf0  [<ffffffff81100dac>] ? 
> cp_new_stat+0xfc/0x120  [<ffffffff814c6a79>] sys_sendmsg+0x49/0x80 
> [<ffffffff815ef86b>] system_call_fastpath+0x16/0x1b

Ok, that was more interesting than I expected. If you are not running a 
iscsiadm logout command, then it looks like the target returned a error code 
indicating that target is not coming back. iscsid handled this by trying to 
remove the target and we are stuck in there. This and the scan command should 
be fast at this time because it looks like the replacement/recovery timeout has 
expired:

 session1: session recovery timed out after 120 secs
 session2: session recovery timed out after 120 secs

At this time, all IO is just fast failed so we are not waiting for IO and any 
time we need that mutex it should be quick.

1. Can you send me a tcpdump trace? I am guessing that you do not want the 
session to be deleted during this time, so we need the trace to see what error 
code is leading to this.

2. I am wondering if the hang is because the device is offlined when this 
happened. I did not try that case. I will setup that kernel and try here.

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.

Reply via email to