>> We should be flushing the workqueue before setting the suspend
>> bit. If we don't a LU Reset might kill commands which are already
>> in the queue and waiting to be send, causing the target to barf.
> What do you mean here? What do you mean by barf? What commands in what
> queue?
In the 'requeue' queue. There might be Data-out requests still waiting
to be sent, but we can't as the xmit thread would be short-circuited
as we've set the SUSPEND bit.

> Are you talking about the problem where we get a tmf response, and do
> not send data-outs? If so this will not always help. The data-out could
> be on the wire and the target could still hit the problem. The problem
> is not ours. The target should not send a tmf response that indicates it
> was successful if there is a r2t that is in progress of being fullfilled.
That is a matter of interpretation. The targets seem to implement the wording
of RFC5048, section 4.1.2:

The target iSCSI layer:
- MUST wait for responses on currently valid target-transfer tags of the
  affected tasks from the issuing initiator.

So, a 'wait' clearly indicates some sort of timer on the target side.
However, what should be the response if not all ttt are being received
from the initiator?
>From what I've gathered, the MSA target in this case will return a
TMF response even when not all data-outs have been received, and
after a certain time will drop the connection to trigger ERL0.

Basically, all I'm trying to do is to run this test-case:

(bonnie/dt on all devices)
for d in /sys/bus/sd*; do
  sg_reset -d /dev/${file##*/}

The result of this test run is that in ca 90% sg_reset will
run through, but the other 10% cause a full ERL0.
And I'm trying to determine the reason for this pecularity.

Weird, though, that apparently I'm hitting quite a bunch
of issues here ...

> Or...
> Are you talking about if we have task 0 and task 1 in progress, then
> send a lu reset that should affect those two tasks, but then we get a
> task2 in the cmdqueue list, and then we get a tmf response, so we end up
> cleaning up task0, task1 and task2? If so, how does the target barf here
> (is it a connection drop or some error because the task does not get sent)?
Hmm. Checking ...
No, I won't be sending any other SCSI command PDU to the affected LUN; that's
being blocked by the check_tmf_restrictions() function.

> If this is your problem, then your patch would work under normal
> conditions, but if you have sndtmo=1 then you could still hit this
> problem a little more easily than normal. If the sendpage code returned
> EAGAIN (which gets converted to ENOBUFS in your patch) due to the sndtmo
> then we will return from
> iscsi_data_xmit without completely sending the task or any task behind
> it (if there was also a task3 queued).
> We could also hit the problem, where if task0 was affected by a TMF then
> iscsi_data_xmit returns EACCESS. Then task2 is not going to get sent,
> because it is stuck behind task0 on the cmdlist. I think in the patches
> [PATCH 1/5] libiscsi: Allow multiple LUN Reset TMF
> [PATCH 2/5] Check for TMF state before sending PDU
> you need to also modify conn->tmf_state and clear the conn->tmf header
> before suspending the xmit thread.
> I was thinking about this the other day and I think we can jsut add a
> check in fail_scsi_task to check for tasks with a cmdsn less than the lu
> reset cmd's cmdsn, right? Maybe we want to fail the affected tasks with
> DID_ERROR, but then for the non-affected tasks we could fail with
> DID_IMM_RETRY. So then if this was a reset from something like sg_reset
> those tasks that were not affected would get another full timeout period
> to run in case the lu reset took a long time.
I actually thought about the same lines, and even added some CmdSn printks.
However, the CmdSn of running task is not getting updated, so currently
these are quite pointless. But I can see what I'm getting.


