On Thu, Jul 30, 2009 at 8:20 PM, Mike Christie<micha...@cs.wisc.edu> wrote:
>
> Erez Zilber wrote:
>> Mike,
>>
>> I'm seeing some strange behavior that I don't completely understand:
>> during heavy IO, I disconnect the cable on the target machine and
>> after a few seconds, I'm trying to logout from that target on the
>> initiator side. Here's what I see:
>>
>> t = 0: iscsi_eh_cmd_timed_out gets called for all commands that timed
>> out. It resets the timer because we have a ping task &
>> time_before_eq(conn->last_recv + (conn->recv_timeout * HZ), jiffies)
>> is true.
>>
>> t = /sys/block/sdX/device/timeout: iscsi_eh_cmd_timed_out gets called
>> again. The session state is ISCSI_STATE_IN_RECOVERY. This will be
>> called again and again every /sys/block/sdX/device/timeout seconds.
>> The function returns and resets the timer.
>>
>> t = 11: I'm using iscsiadm to logout from the bad session
>>
>> t = 60 sec: __iscsi_block_session is called. I guess that the call
>> stack is something like: mgmt_ipc_session_logout ->
>> session_logout_task -> session_conn_shutdown -> iscsi_sw_tcp_conn_stop
>> -> iscsi_conn_stop -> iscsi_start_session_recovery ->
>> iscsi_block_session -> __iscsi_block_session. I'm not sure if this was
>> initiated by the logout that iscsiadm initiated 49 seconds ago.
>>
>> Now, I see that the commands that were in flight time out:
>>
>> Jul 30 18:26:46 be8 kernel: sd 2:0:0:0: timing out command, waited 42s
>> Jul 30 18:26:46 be8 kernel: sd 2:0:0:0: SCSI error: return code = 0x060e0000
>> Jul 30 18:26:46 be8 kernel: end_request: I/O error, dev sdc, sector 100663048
>> Jul 30 18:26:46 be8 kernel: sd 2:0:0:0: timing out command, waited 42s
>> Jul 30 18:26:46 be8 kernel: sd 2:0:0:0: SCSI error: return code = 0x06000000
>> Jul 30 18:26:46 be8 kernel: end_request: I/O error, dev sdc, sector 100662792
>> Jul 30 18:26:46 be8 kernel: device-mapper: multipath: Failing path 8:32.
>> Jul 30 18:26:46 be8 kernel: sd 2:0:0:0: timing out command, waited 42s
>> Jul 30 18:26:46 be8 kernel: sd 2:0:0:0: SCSI error: return code = 0x060e0000
>> Jul 30 18:26:46 be8 kernel: end_request: I/O error, dev sdc, sector 41943296
>>
>> t = 61: ctldev_handle calls iscsi_sched_conn_context with
>> EV_CONN_ERROR. I'm not sure where this came from.
>>
>> What I don't understand is why __iscsi_block_session was called only
>> after 60 seconds. Is it configurable? Anyway, here's the node info:
>>
>
> block session is only called from iscsi_conn_stop.
>
> iscsi_conn_stop is called when there is a connection error like 1011.
> The kernel will throw the error, then send a msg to userspace. Userspace
> will then call ep_disconnect to have the kernel clean up the kernel
> structs for the ep.  Then when that has completed, iscsid will call
> iscsi_conn_stop.
>
> What could be delaying the block session is that the
> iscsi_start_session_recovery -> iscsi_suspend_tx will wait for the xmit
> thread to flush. Before we had a timeout of 15 seconds so we could be
> waiting in there that long. I just modified that to 3 seconds in that
> sndtmo patch (that is also in the linux-2.6-iscsi git tree),

Sounds good. I will try this patch on my tree.


> but what we really want is a way to just signal that thread to wake right 
> away.
>

Is there a way to do that (signal the thread )?

> What also could be delaying it is the ping timeout detection. Do you see
> a ping timeout message in the log? How long after the cable pull is it?

I see a ping timeout message on t = 2. It looks like this (with extra
logs that I added):

Jul 30 18:25:48 be8 kernel:  connection2:0: ping timeout of 5 secs
expired, last rx 4295216248, last ping 4295221248, now 4295226248
Jul 30 18:25:48 be8 kernel: iscsi_check_transport_timeouts: calling
iscsi_conn_failure
Jul 30 18:25:48 be8 kernel: iscsi_conn_failure: calling iscsi_conn_error_event
Jul 30 18:25:48 be8 kernel:  connection2:0: detected conn error (1011)

>  Is it close to the noop values in the iscsid.conf you set?

I'm using the default 5 seconds ping timeout. The cmd timeout is 7
seconds. Note that t = 0 is the time of the 1st error, not the exact
time when the cable was pulled out.

> When the ping timesout, we will throw the conn error 1011, which kicks off the
> recovery mentioned above.
>
>
>
> Block session can also be called as a result of the iscsiadm logout.
>
> If we have not detected a connection problem, then session_logout_task
> will call session_unbind, which will end up removing the target. If we
> have to send a sync cache for the device then the unbind would wait for
> the sync to complete.

Can this explain why the logout is executed only around t = 60? Is it
because we're syncing the cache? Where in the code does this cache
sync get called?

> Once the unbind is done, we will send a logout, then when that is done 
> iscsi_recv_logout_rsp will call
> session_conn_shutdown, and that ends up calling iscsi_conn_stop. You
> probably do not go completely through this path though, because at some
> point we should detect the connection problem with the nop/ping.
>

But the connection problem was detected at t = 2...

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~----------~----~----~----~------~----~------~--~---

Reply via email to