Re: Debugging scsi abort handling ?

Hannes Reinecke Thu, 28 Aug 2014 05:11:54 -0700

On 08/26/2014 09:19 PM, Hans de Goede wrote:

Hi,


On 08/26/2014 08:34 PM, James Bottomley wrote:

On Tue, 2014-08-26 at 10:13 +0200, Hans de Goede wrote:

Hi,

On 08/25/2014 05:41 PM, James Bottomley wrote:

On Mon, 2014-08-25 at 13:39 +0200, Paolo Bonzini wrote:

Il 25/08/2014 13:26, Hans de Goede ha scritto:

Thanks Bart and Paolo, your insights into this are greatly
appreciated.

So with uas there are separate usb transaction for cmd, data
in, data out
and sense for each tag. At the time of abort, usually one of
data in / data
out and a sense usb transaction will be outstanding.

There already is logic in the driver to kill the data in / out
transactions
if a sense gets returned (usually with an error) before they
are done.

So if I'm reading this correctly, then on a successful abort,
the sense
transaction (if not already completed by the target) should be
cancelled as
it will never complete, correct ?


Yes.  More precisely, once the response IU comes back for the
abort,
there should be no more IUs for that task.  Figure 13
("Multiple Command
Example") in the UAS spec actually shows that.

At least, that's what it should be like in theory.  I suspect
firmware
bugs will abound in this area, but at least you shouldn't be
actively
expecting an IU for an aborted task.


Just a note on this.  The abort area in all devices is where we
have
scope for spec compliance issues.  Also in the old days abort
itself
could trigger a firmware crash on some devices (including
arrays).  The
problem is actually that the system is usually groaning because
it's out
of memory within the controller.  That actually means that
sending in
another task (the abort) causes greater pressure.  For this
reason, some
device drivers choose to skip the abort step and head straight
to reset.
For reset, you guarantee that all outstanding tasks for the unit
are
killed.


Hmm, I like this idea, given the finickiness of the abort path in
the uas
driver, and that:

1) We really have no proper way to test this
2) We already have some known issues there (we don't kill sense
urbs atm,
    and I've a note somewhere about a double free on some corner
case
    where an urb submit fails)
3)
3) In all the cases where I've managed to trip op an uas device
the only
    thing which actually worked to recover things was doing a usb
reset
4) Aborts should not happen in the first place, so using a big
hammer
    when they do is not really a big problem, instead we should
focus
    on figuring out why they happen and fix the cause

I think that just dropping abort handling altogether is a good
idea actually,
so if there are no objections I'm just going to do that.

I can simply not set eh_abort_handler (aka set it to NULL), right ?


Just so you know, if you do this, error handling becomes much more
painful.  The abort handler can now handle aborting and retrying
without
pausing the host.  The reset definitely stops the host and causes
a big
and noticeable burp in processing.  However, there are hosts which
have
to do it.


I understand, but shouldn't aborts be something which rarely happens.
I guess that with a faulty drive they happen more often, but at this
point in time the uas driver's error handling problems are mostly with
dealing with timeouts during probing, which are likely caused by
the device not grokking some command we've send, and responding to
this in a bad manner (hanging mostly).

So I'm tempted to just remove the abort handling, and first focus on
getting uas stable under all conditions. Once it is stable we can
look into making it deal more graciously with errors.

Sounds like a reasonable plan.

[ .. ]


Wait, could this also be your abort problem?  In the running abort
handler, we could get a scenario where we abort a bunch of
commands at the same time.


I don't think so, the uas code does not return from eh_abort_handler
until the abort has either succeeded or timedout (for which a 3 second
timeout is used). AFAIK the scsi core will always issue multiple aborts
sequentially, right. And if the abort succeeds then the tag is free
again for the next abort. Only if an abort fails (times out, which is
what I've seen in all the error conditions which I've encountered so
far),
then will subsequent aborts, and the logical unit reset, fail due to
there not being a free tag to use for them.

If the logic for command/task abort and logical unit reset issimilar then there is no point in implementing both.So by what I've seen I would first implement target reset (and hostreset :-), and worry about finer grained error control later on.


Cheers,

Hannes
--
Dr. Hannes Reinecke                   zSeries & Storage
h...@suse.de                          +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

Reply via email to