On 03/08/2010 05:56 AM, Or Gerlitz wrote:
Mike Christie wrote:
I have uploaded a new bug fix release to:
http://kernel.org/pub/linux/kernel/people/mnc/open-iscsi/releases/open-iscsi-2.0-871.3.tar.gz
This fixes two bugs:
2. Instead of failover taking node.session.timeo.replacement_timeout
seconds, it takes scsi_cmd->timeout * scsi_cmd->retries seconds (with
default settings this is about 60 secs * 5 = 3 minutes).

Both of these bugs are iscsi kernel modules bugs. #2 is a regression
that was added in the iscsi kernel modules in open-iscsi-2.0-871.0 and
the iscsi kernel modules in upstream kernels 2.6.28 and newer.

Mike,

Looking on the 2.0.871.2 to .3 diff (below), I don't  see where is the relation 
to using the
scsi cmd timeout/retries vs the session replacement_timeout.


>    else if (conn->stop_stage != STOP_CONN_RECOVER)
>            session->state = ISCSI_STATE_IN_RECOVERY;

We set session state here under the session lock.

> +
> +  old_stop_stage = conn->stop_stage;
> +  conn->stop_stage = flag;
>    spin_unlock_bh(&session->lock);
>

We drop the lock.

Recv context then calls iscsi_conn_failure and without the patch sees that stop_stage is not set, so it resets the session->state to failied.

>    del_timer_sync(&conn->transport_timer);
>    iscsi_suspend_tx(conn);
>
>    spin_lock_bh(&session->lock);
> -  old_stop_stage = conn->stop_stage;
> -  conn->stop_stage = flag;
>    conn->c_stage = ISCSI_CONN_STOPPED;
>    spin_unlock_bh(&session->lock);

A little bit past this then we look at session->state and see that it is not in recovery so we do not block the session. A little past that we fail IO with DID_TRANSPORT_DISRUPTED. The scsi layer then reqeueus the IO. The block layer sees that it is a requeue and plugs the queue. A little bit later the block layer unplugs the queue. The scsi layer sees that the device is running and send request to us. iscsi_queuecommand sees that the session state is failed and tells the scsi layer to requeue. The scsi and block layer do their thing again. This goes on and on until scsi_softirq_done() sees that the cmd has run for cmd->allowed * timeout, and then finally fails the IO.






More over, this patch has two out of three-four element present in the patch 
posted on your

The code in 2.0.871.3 has the other code in the other patch already. RHEL 5.5 did not.


"Re: Failover time of iSCSI multipath devices" march 3rd response, where there 
you have said that
"There is a race where the session->state can get reset due to the xmit thread throwing 
an error after we have set the session->state but before we have set the stop_stage".

Can you clarify this please? maybe there is some error here. I didn't hit the problem 
with the replacement_timeout... also when you say "failover taking" do you mean 
for the multi-path to go and use a different device when working in failover 
configuration (e.g not multibus)?


You are not always going to hit the race. It is difficult to hit for me. I hit it maybe 1/50 runs.

Failover just mean switching the IO to the other path. It could be in failover or multibus mode.

--
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.

Reply via email to