Erez Zilber wrote:
> Mike,
> 
> We're testing open-iscsi + multipath. In order to make failover faster,
> we changed the following defaults:
> 
> node.session.timeo.replacement_timeout = 30
> node.conn[0].timeo.noop_out_timeout = 5

Is .timeo.noop_out_interval 10?

> 
> So, we see that ep_disconnect is called and then "session recovery timed

Before you see the ep_disconnect getting called you should see all the 
running commands failed and sent to dm:

This code in initiator.c: should stop the conn and when that happens, 
libiscsi will fail the running commands to the scsi layer which should 
fail them to dm right away because failfast is set.

         if (do_stop) {
                 /* state: STATE_CLEANUP_WAIT */
                 if (ipc->stop_conn(session->t->handle, session->id,
                                    conn->id, do_stop)) {
                         log_error("can't stop connection %d:%d (%d)",
                                   session->id, conn->id, errno);
                         delay = 5;
                         goto queue_reopen;
                 }
                 log_debug(3, "connection %d:%d is stopped for recovery",
                           session->id, conn->id);
         }
         conn->session->t->template->ep_disconnect(conn);


> out after 30 secs". After that, we still have to wait more than a minute
> until the SCSI device becomes offline. For example, if we run sg_map -i
> -x at that time, it doesn't return until the device becomes offline. We

This is expected. If a command gets sent to the path while the scsi 
layer's eh is running (or if the nop timeout does not catch the problem 
before the scsi command timeout fires) you have to wait up to 
node.session.timeo.replacement_timeouts + scsi command timeout for 
commands to be failed.

> think that this may be due to a timeout in scsi-ml, is it? How can we
> control it (because failover is really slow now - 1.5-2 minutes)?
> 

If your problem is that there is no IO to the path, you pull a cable, 
then send IO to the path, with your current settings the failover is 
going to take node.session.timeo.replacement_timeouts + scsi command 
timeout seconds. On most distros that will be 1.5 minutes (30 sec 
replacement and scsi timer is 50 secs). So set the scsi command timer 
lower and set the replacement timer lower. If you search the list, 
people that have wanted really fast failovers and rely on dm's queueing, 
use a lot lower values than I mentioned in the README.

If your problem is that there is IO on the patch, you pull a cable, and 
then you do not see those IOs getting failed by the stop conn call, 
within noop interval + noop timeout seconds, then there is bug in the 
iscsi layer. You should turn on debugging and send the output.


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~----------~----~----~----~------~----~------~--~---

Reply via email to