On Thu, Jul 30, 2009 at 8:20 PM, Mike Christie<[email protected]> wrote: > > Erez Zilber wrote: >> Mike, >> >> I'm seeing some strange behavior that I don't completely understand: >> during heavy IO, I disconnect the cable on the target machine and >> after a few seconds, I'm trying to logout from that target on the >> initiator side. Here's what I see: >> >> t = 0: iscsi_eh_cmd_timed_out gets called for all commands that timed >> out. It resets the timer because we have a ping task & >> time_before_eq(conn->last_recv + (conn->recv_timeout * HZ), jiffies) >> is true. >> >> t = /sys/block/sdX/device/timeout: iscsi_eh_cmd_timed_out gets called >> again. The session state is ISCSI_STATE_IN_RECOVERY. This will be >> called again and again every /sys/block/sdX/device/timeout seconds. >> The function returns and resets the timer. >> >> t = 11: I'm using iscsiadm to logout from the bad session >> >> t = 60 sec: __iscsi_block_session is called. I guess that the call >> stack is something like: mgmt_ipc_session_logout -> >> session_logout_task -> session_conn_shutdown -> iscsi_sw_tcp_conn_stop >> -> iscsi_conn_stop -> iscsi_start_session_recovery -> >> iscsi_block_session -> __iscsi_block_session. I'm not sure if this was >> initiated by the logout that iscsiadm initiated 49 seconds ago. >> >> Now, I see that the commands that were in flight time out: >> >> Jul 30 18:26:46 be8 kernel: sd 2:0:0:0: timing out command, waited 42s >> Jul 30 18:26:46 be8 kernel: sd 2:0:0:0: SCSI error: return code = 0x060e0000 >> Jul 30 18:26:46 be8 kernel: end_request: I/O error, dev sdc, sector 100663048 >> Jul 30 18:26:46 be8 kernel: sd 2:0:0:0: timing out command, waited 42s >> Jul 30 18:26:46 be8 kernel: sd 2:0:0:0: SCSI error: return code = 0x06000000 >> Jul 30 18:26:46 be8 kernel: end_request: I/O error, dev sdc, sector 100662792 >> Jul 30 18:26:46 be8 kernel: device-mapper: multipath: Failing path 8:32. >> Jul 30 18:26:46 be8 kernel: sd 2:0:0:0: timing out command, waited 42s >> Jul 30 18:26:46 be8 kernel: sd 2:0:0:0: SCSI error: return code = 0x060e0000 >> Jul 30 18:26:46 be8 kernel: end_request: I/O error, dev sdc, sector 41943296 >> >> t = 61: ctldev_handle calls iscsi_sched_conn_context with >> EV_CONN_ERROR. I'm not sure where this came from. >> >> What I don't understand is why __iscsi_block_session was called only >> after 60 seconds. Is it configurable? Anyway, here's the node info: >> > > block session is only called from iscsi_conn_stop. > > iscsi_conn_stop is called when there is a connection error like 1011. > The kernel will throw the error, then send a msg to userspace. Userspace > will then call ep_disconnect to have the kernel clean up the kernel > structs for the ep. Then when that has completed, iscsid will call > iscsi_conn_stop. > > What could be delaying the block session is that the > iscsi_start_session_recovery -> iscsi_suspend_tx will wait for the xmit > thread to flush. Before we had a timeout of 15 seconds so we could be > waiting in there that long. I just modified that to 3 seconds in that > sndtmo patch (that is also in the linux-2.6-iscsi git tree),
Sounds good. I will try this patch on my tree. > but what we really want is a way to just signal that thread to wake right > away. > Is there a way to do that (signal the thread )? > What also could be delaying it is the ping timeout detection. Do you see > a ping timeout message in the log? How long after the cable pull is it? I see a ping timeout message on t = 2. It looks like this (with extra logs that I added): Jul 30 18:25:48 be8 kernel: connection2:0: ping timeout of 5 secs expired, last rx 4295216248, last ping 4295221248, now 4295226248 Jul 30 18:25:48 be8 kernel: iscsi_check_transport_timeouts: calling iscsi_conn_failure Jul 30 18:25:48 be8 kernel: iscsi_conn_failure: calling iscsi_conn_error_event Jul 30 18:25:48 be8 kernel: connection2:0: detected conn error (1011) > Is it close to the noop values in the iscsid.conf you set? I'm using the default 5 seconds ping timeout. The cmd timeout is 7 seconds. Note that t = 0 is the time of the 1st error, not the exact time when the cable was pulled out. > When the ping timesout, we will throw the conn error 1011, which kicks off the > recovery mentioned above. > > > > Block session can also be called as a result of the iscsiadm logout. > > If we have not detected a connection problem, then session_logout_task > will call session_unbind, which will end up removing the target. If we > have to send a sync cache for the device then the unbind would wait for > the sync to complete. Can this explain why the logout is executed only around t = 60? Is it because we're syncing the cache? Where in the code does this cache sync get called? > Once the unbind is done, we will send a logout, then when that is done > iscsi_recv_logout_rsp will call > session_conn_shutdown, and that ends up calling iscsi_conn_stop. You > probably do not go completely through this path though, because at some > point we should detect the connection problem with the nop/ping. > But the connection problem was detected at t = 2... --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/open-iscsi -~----------~----~----~----~------~----~------~--~---
