Hi, On 08/28/2014 02:10 PM, Hannes Reinecke wrote: > On 08/26/2014 09:19 PM, Hans de Goede wrote: >> Hi, >> >> On 08/26/2014 08:34 PM, James Bottomley wrote: >>> On Tue, 2014-08-26 at 10:13 +0200, Hans de Goede wrote: >>>> Hi, >>>> >>>> On 08/25/2014 05:41 PM, James Bottomley wrote: >>>>> On Mon, 2014-08-25 at 13:39 +0200, Paolo Bonzini wrote: >>>>>> Il 25/08/2014 13:26, Hans de Goede ha scritto: >>>>>>> Thanks Bart and Paolo, your insights into this are greatly >>>>>>> appreciated. >>>>>>> >>>>>>> So with uas there are separate usb transaction for cmd, data >>>>>>> in, data out >>>>>>> and sense for each tag. At the time of abort, usually one of >>>>>>> data in / data >>>>>>> out and a sense usb transaction will be outstanding. >>>>>>> >>>>>>> There already is logic in the driver to kill the data in / out >>>>>>> transactions >>>>>>> if a sense gets returned (usually with an error) before they >>>>>>> are done. >>>>>>> >>>>>>> So if I'm reading this correctly, then on a successful abort, >>>>>>> the sense >>>>>>> transaction (if not already completed by the target) should be >>>>>>> cancelled as >>>>>>> it will never complete, correct ? >>>>>> >>>>>> Yes. More precisely, once the response IU comes back for the >>>>>> abort, >>>>>> there should be no more IUs for that task. Figure 13 >>>>>> ("Multiple Command >>>>>> Example") in the UAS spec actually shows that. >>>>>> >>>>>> At least, that's what it should be like in theory. I suspect >>>>>> firmware >>>>>> bugs will abound in this area, but at least you shouldn't be >>>>>> actively >>>>>> expecting an IU for an aborted task. >>>>> >>>>> Just a note on this. The abort area in all devices is where we >>>>> have >>>>> scope for spec compliance issues. Also in the old days abort >>>>> itself >>>>> could trigger a firmware crash on some devices (including >>>>> arrays). The >>>>> problem is actually that the system is usually groaning because >>>>> it's out >>>>> of memory within the controller. That actually means that >>>>> sending in >>>>> another task (the abort) causes greater pressure. For this >>>>> reason, some >>>>> device drivers choose to skip the abort step and head straight >>>>> to reset. >>>>> For reset, you guarantee that all outstanding tasks for the unit >>>>> are >>>>> killed. >>>> >>>> Hmm, I like this idea, given the finickiness of the abort path in >>>> the uas >>>> driver, and that: >>>> >>>> 1) We really have no proper way to test this >>>> 2) We already have some known issues there (we don't kill sense >>>> urbs atm, >>>> and I've a note somewhere about a double free on some corner >>>> case >>>> where an urb submit fails) >>>> 3) >>>> 3) In all the cases where I've managed to trip op an uas device >>>> the only >>>> thing which actually worked to recover things was doing a usb >>>> reset >>>> 4) Aborts should not happen in the first place, so using a big >>>> hammer >>>> when they do is not really a big problem, instead we should >>>> focus >>>> on figuring out why they happen and fix the cause >>>> >>>> I think that just dropping abort handling altogether is a good >>>> idea actually, >>>> so if there are no objections I'm just going to do that. >>>> >>>> I can simply not set eh_abort_handler (aka set it to NULL), right ? >>> >>> Just so you know, if you do this, error handling becomes much more >>> painful. The abort handler can now handle aborting and retrying >>> without >>> pausing the host. The reset definitely stops the host and causes >>> a big >>> and noticeable burp in processing. However, there are hosts which >>> have >>> to do it. >> >> I understand, but shouldn't aborts be something which rarely happens. >> I guess that with a faulty drive they happen more often, but at this >> point in time the uas driver's error handling problems are mostly with >> dealing with timeouts during probing, which are likely caused by >> the device not grokking some command we've send, and responding to >> this in a bad manner (hanging mostly). >> >> So I'm tempted to just remove the abort handling, and first focus on >> getting uas stable under all conditions. Once it is stable we can >> look into making it deal more graciously with errors. >> > Sounds like a reasonable plan. > > [ .. ] >>> >>> Wait, could this also be your abort problem? In the running abort >>> handler, we could get a scenario where we abort a bunch of >>> commands at the same time. >> >> I don't think so, the uas code does not return from eh_abort_handler >> until the abort has either succeeded or timedout (for which a 3 second >> timeout is used). AFAIK the scsi core will always issue multiple aborts >> sequentially, right. And if the abort succeeds then the tag is free >> again for the next abort. Only if an abort fails (times out, which is >> what I've seen in all the error conditions which I've encountered so >> far), >> then will subsequent aborts, and the logical unit reset, fail due to >> there not being a free tag to use for them. >> > If the logic for command/task abort and logical unit reset is similar then > there is no point in implementing both.
The logic is different in that when an abort gets issued other commands may (on paper) still complete normally, where as with a lun reset all commands on that lun (and usually there is only one lun) are toast. > So by what I've seen I would first implement target reset (and host reset :-) With uas target == host, and that is already implemented and so far seems to be the only thing which actually works (if bugs in eh_abort don't cause an oops first). Regards, Hans -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html