On 03/08/2010 05:56 AM, Or Gerlitz wrote:
Mike Christie wrote:
I have uploaded a new bug fix release to:
http://kernel.org/pub/linux/kernel/people/mnc/open-iscsi/releases/open-iscsi-2.0-871.3.tar.gz
This fixes two bugs:
2. Instead of failover taking node.session.timeo.replacement_timeout
seconds, it takes scsi_cmd->timeout * scsi_cmd->retries seconds (with
default settings this is about 60 secs * 5 = 3 minutes).
Both of these bugs are iscsi kernel modules bugs. #2 is a regression
that was added in the iscsi kernel modules in open-iscsi-2.0-871.0 and
the iscsi kernel modules in upstream kernels 2.6.28 and newer.
Mike,
Looking on the 2.0.871.2 to .3 diff (below), I don't see where is the relation
to using the
scsi cmd timeout/retries vs the session replacement_timeout.
> else if (conn->stop_stage != STOP_CONN_RECOVER)
> session->state = ISCSI_STATE_IN_RECOVERY;
We set session state here under the session lock.
> +
> + old_stop_stage = conn->stop_stage;
> + conn->stop_stage = flag;
> spin_unlock_bh(&session->lock);
>
We drop the lock.
Recv context then calls iscsi_conn_failure and without the patch sees
that stop_stage is not set, so it resets the session->state to failied.
> del_timer_sync(&conn->transport_timer);
> iscsi_suspend_tx(conn);
>
> spin_lock_bh(&session->lock);
> - old_stop_stage = conn->stop_stage;
> - conn->stop_stage = flag;
> conn->c_stage = ISCSI_CONN_STOPPED;
> spin_unlock_bh(&session->lock);
A little bit past this then we look at session->state and see that it is
not in recovery so we do not block the session. A little past that we
fail IO with DID_TRANSPORT_DISRUPTED. The scsi layer then reqeueus the
IO. The block layer sees that it is a requeue and plugs the queue. A
little bit later the block layer unplugs the queue. The scsi layer sees
that the device is running and send request to us. iscsi_queuecommand
sees that the session state is failed and tells the scsi layer to
requeue. The scsi and block layer do their thing again. This goes on and
on until scsi_softirq_done() sees that the cmd has run for cmd->allowed
* timeout, and then finally fails the IO.
More over, this patch has two out of three-four element present in the patch
posted on your
The code in 2.0.871.3 has the other code in the other patch already.
RHEL 5.5 did not.
"Re: Failover time of iSCSI multipath devices" march 3rd response, where there
you have said that
"There is a race where the session->state can get reset due to the xmit thread throwing
an error after we have set the session->state but before we have set the stop_stage".
Can you clarify this please? maybe there is some error here. I didn't hit the problem
with the replacement_timeout... also when you say "failover taking" do you mean
for the multi-path to go and use a different device when working in failover
configuration (e.g not multibus)?
You are not always going to hit the race. It is difficult to hit for me.
I hit it maybe 1/50 runs.
Failover just mean switching the IO to the other path. It could be in
failover or multibus mode.
--
You received this message because you are subscribed to the Google Groups
"open-iscsi" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/open-iscsi?hl=en.