Re: Debugging scsi abort handling ?

Hans de Goede Thu, 28 Aug 2014 05:24:36 -0700

Hi,

On 08/28/2014 02:10 PM, Hannes Reinecke wrote:
> On 08/26/2014 09:19 PM, Hans de Goede wrote:
>> Hi,
>>
>> On 08/26/2014 08:34 PM, James Bottomley wrote:
>>> On Tue, 2014-08-26 at 10:13 +0200, Hans de Goede wrote:
>>>> Hi,
>>>>
>>>> On 08/25/2014 05:41 PM, James Bottomley wrote:
>>>>> On Mon, 2014-08-25 at 13:39 +0200, Paolo Bonzini wrote:
>>>>>> Il 25/08/2014 13:26, Hans de Goede ha scritto:
>>>>>>> Thanks Bart and Paolo, your insights into this are greatly
>>>>>>> appreciated.
>>>>>>>
>>>>>>> So with uas there are separate usb transaction for cmd, data
>>>>>>> in, data out
>>>>>>> and sense for each tag. At the time of abort, usually one of
>>>>>>> data in / data
>>>>>>> out and a sense usb transaction will be outstanding.
>>>>>>>
>>>>>>> There already is logic in the driver to kill the data in / out
>>>>>>> transactions
>>>>>>> if a sense gets returned (usually with an error) before they
>>>>>>> are done.
>>>>>>>
>>>>>>> So if I'm reading this correctly, then on a successful abort,
>>>>>>> the sense
>>>>>>> transaction (if not already completed by the target) should be
>>>>>>> cancelled as
>>>>>>> it will never complete, correct ?
>>>>>>
>>>>>> Yes.  More precisely, once the response IU comes back for the
>>>>>> abort,
>>>>>> there should be no more IUs for that task.  Figure 13
>>>>>> ("Multiple Command
>>>>>> Example") in the UAS spec actually shows that.
>>>>>>
>>>>>> At least, that's what it should be like in theory.  I suspect
>>>>>> firmware
>>>>>> bugs will abound in this area, but at least you shouldn't be
>>>>>> actively
>>>>>> expecting an IU for an aborted task.
>>>>>
>>>>> Just a note on this.  The abort area in all devices is where we
>>>>> have
>>>>> scope for spec compliance issues.  Also in the old days abort
>>>>> itself
>>>>> could trigger a firmware crash on some devices (including
>>>>> arrays).  The
>>>>> problem is actually that the system is usually groaning because
>>>>> it's out
>>>>> of memory within the controller.  That actually means that
>>>>> sending in
>>>>> another task (the abort) causes greater pressure.  For this
>>>>> reason, some
>>>>> device drivers choose to skip the abort step and head straight
>>>>> to reset.
>>>>> For reset, you guarantee that all outstanding tasks for the unit
>>>>> are
>>>>> killed.
>>>>
>>>> Hmm, I like this idea, given the finickiness of the abort path in
>>>> the uas
>>>> driver, and that:
>>>>
>>>> 1) We really have no proper way to test this
>>>> 2) We already have some known issues there (we don't kill sense
>>>> urbs atm,
>>>>     and I've a note somewhere about a double free on some corner
>>>> case
>>>>     where an urb submit fails)
>>>> 3)
>>>> 3) In all the cases where I've managed to trip op an uas device
>>>> the only
>>>>     thing which actually worked to recover things was doing a usb
>>>> reset
>>>> 4) Aborts should not happen in the first place, so using a big
>>>> hammer
>>>>     when they do is not really a big problem, instead we should
>>>> focus
>>>>     on figuring out why they happen and fix the cause
>>>>
>>>> I think that just dropping abort handling altogether is a good
>>>> idea actually,
>>>> so if there are no objections I'm just going to do that.
>>>>
>>>> I can simply not set eh_abort_handler (aka set it to NULL), right ?
>>>
>>> Just so you know, if you do this, error handling becomes much more
>>> painful.  The abort handler can now handle aborting and retrying
>>> without
>>> pausing the host.  The reset definitely stops the host and causes
>>> a big
>>> and noticeable burp in processing.  However, there are hosts which
>>> have
>>> to do it.
>>
>> I understand, but shouldn't aborts be something which rarely happens.
>> I guess that with a faulty drive they happen more often, but at this
>> point in time the uas driver's error handling problems are mostly with
>> dealing with timeouts during probing, which are likely caused by
>> the device not grokking some command we've send, and responding to
>> this in a bad manner (hanging mostly).
>>
>> So I'm tempted to just remove the abort handling, and first focus on
>> getting uas stable under all conditions. Once it is stable we can
>> look into making it deal more graciously with errors.
>>
> Sounds like a reasonable plan.
> 
> [ .. ]
>>>
>>> Wait, could this also be your abort problem?  In the running abort
>>> handler, we could get a scenario where we abort a bunch of
>>> commands at the same time.
>>
>> I don't think so, the uas code does not return from eh_abort_handler
>> until the abort has either succeeded or timedout (for which a 3 second
>> timeout is used). AFAIK the scsi core will always issue multiple aborts
>> sequentially, right. And if the abort succeeds then the tag is free
>> again for the next abort. Only if an abort fails (times out, which is
>> what I've seen in all the error conditions which I've encountered so
>> far),
>> then will subsequent aborts, and the logical unit reset, fail due to
>> there not being a free tag to use for them.
>>
> If the logic for command/task abort and logical unit reset is similar then 
> there is no point in implementing both.


The logic is different in that when an abort gets issued other commands may
(on paper) still complete normally, where as with a lun reset all commands
on that lun (and usually there is only one lun) are toast.

> So by what I've seen I would first implement target reset (and host reset :-)

With uas target == host, and that is already implemented and so far seems
to be the only thing which actually works (if bugs in eh_abort don't cause
an oops first).

Regards,

Hans
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Debugging scsi abort handling ?

Reply via email to