Re: [ANNOUNCE] open-iscsi-2.0-871.3

Mike Christie Mon, 08 Mar 2010 12:34:27 -0800

On 03/08/2010 05:56 AM, Or Gerlitz wrote:

Mike Christie wrote:

I have uploaded a new bug fix release to:
http://kernel.org/pub/linux/kernel/people/mnc/open-iscsi/releases/open-iscsi-2.0-871.3.tar.gz
This fixes two bugs:
2. Instead of failover taking node.session.timeo.replacement_timeout
seconds, it takes scsi_cmd->timeout * scsi_cmd->retries seconds (with
default settings this is about 60 secs * 5 = 3 minutes).

Both of these bugs are iscsi kernel modules bugs. #2 is a regression
that was added in the iscsi kernel modules in open-iscsi-2.0-871.0 and
the iscsi kernel modules in upstream kernels 2.6.28 and newer.


Mike,

Looking on the 2.0.871.2 to .3 diff (below), I don't  see where is the relation 
to using the
scsi cmd timeout/retries vs the session replacement_timeout.



>    else if (conn->stop_stage != STOP_CONN_RECOVER)
>            session->state = ISCSI_STATE_IN_RECOVERY;

We set session state here under the session lock.

> +
> +  old_stop_stage = conn->stop_stage;
> +  conn->stop_stage = flag;
>    spin_unlock_bh(&session->lock);
>

We drop the lock.

Recv context then calls iscsi_conn_failure and without the patch seesthat stop_stage is not set, so it resets the session->state to failied.


>    del_timer_sync(&conn->transport_timer);
>    iscsi_suspend_tx(conn);
>
>    spin_lock_bh(&session->lock);
> -  old_stop_stage = conn->stop_stage;
> -  conn->stop_stage = flag;
>    conn->c_stage = ISCSI_CONN_STOPPED;
>    spin_unlock_bh(&session->lock);

A little bit past this then we look at session->state and see that it isnot in recovery so we do not block the session. A little past that wefail IO with DID_TRANSPORT_DISRUPTED. The scsi layer then reqeueus theIO. The block layer sees that it is a requeue and plugs the queue. Alittle bit later the block layer unplugs the queue. The scsi layer seesthat the device is running and send request to us. iscsi_queuecommandsees that the session state is failed and tells the scsi layer torequeue. The scsi and block layer do their thing again. This goes on andon until scsi_softirq_done() sees that the cmd has run for cmd->allowed* timeout, and then finally fails the IO.


More over, this patch has two out of three-four element present in the patch 
posted on your

The code in 2.0.871.3 has the other code in the other patch already.RHEL 5.5 did not.

"Re: Failover time of iSCSI multipath devices" march 3rd response, where there 
you have said that
"There is a race where the session->state can get reset due to the xmit thread throwing 
an error after we have set the session->state but before we have set the stop_stage".

Can you clarify this please? maybe there is some error here. I didn't hit the problem 
with the replacement_timeout... also when you say "failover taking" do you mean 
for the multi-path to go and use a different device when working in failover 
configuration (e.g not multibus)?

You are not always going to hit the race. It is difficult to hit for me.I hit it maybe 1/50 runs.

Failover just mean switching the IO to the other path. It could be infailover or multibus mode.


--
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.

Re: [ANNOUNCE] open-iscsi-2.0-871.3

Reply via email to