On Thu, 2014-03-20 at 11:36 -0400, Alan Stern wrote:
> On Wed, 19 Mar 2014, James Bottomley wrote:
>
> > > Basically, usb-storage deadlocks when the SCSI error handler invokes
> > > the eh_device_reset_handler callback while a command is running. The
> > > command has timed out and will never complete normally, because the
> > > device's firmware has crashed. But usb-storage's device-reset routine
> > > waits for the current command to finish, which brings everything to a
> > > standstill.
> > >
> > > Is this design wrong? That is, should the device-reset routine wait
> > > for currently executing commands to finish, or should it abort them, or
> > > what?
> >
> > In some sense, yes, but not necessarily from the Point of View of USB.
> > What we assume in SCSI is that commands are forgettable, meaning there's
> > always some operation we can perform (be it abort or reset) that causes
> > the device to forget about all outstanding commands and reset its
> > internal state machine to a known good state.
> >
> > The cardinal SCSI assumption is that after we've successfully done an
> > abort or reset on a command that it will never come back to us from the
> > device.
>
> The same assumption is present in usb-storage.
>
> The difference appears to be that SCSI assumes a reset can be issued at
> any time, even while a command is running. In usb-storage, that's true
> for a _bus_ reset but it's not true for a _device_ reset.
>
> Perhaps this restriction on device resets could be lifted, but in real
> life it probably won't make much difference. The fact is, almost no
> USB mass-storage devices implement device reset correctly, whereas most
> of them do implement bus reset. (Possibly related factoid: MS Windows
> uses bus resets but doesn't use device resets in its USB mass-storage
> driver.)
Actually, you don't have to lift the restriction. Just fail the device
reset if you can't issue it (so if there's any outstanding commands).
SCSI will move on to a bus reset which you can accept.
> > > Or should the SCSI error handler abort the running command before
> > > invoking the eh_device_reset_handler callback?
> >
> > So this is rooted in the "Abort can be a Problem" issue: Abort
> > sometimes works well (and it's not very disruptive) but sometimes if the
> > device is having a problem in its command state machine, adding another
> > command (which is what the abort is) doesn't actually do anything, so we
> > need error escalation to reset. We can't wait for the abort or other
> > commands to complete because they never will. The reset is expected to
> > clear everything from the device (including the pending aborts).
>
> With usb-storage, aborts usually work pretty well. But sometimes they
> don't cure the underlying cause, so when the offending command is
> retried, it hangs up again. Something like that seems to be happening
> here.
>
> > > For the record, and in case anyone is curious, here's the detailed
> > > sequence of events during my test:
> > >
> > > sd issues a READ(10) command. For whatever reason, the device
> > > goes nuts and the command times out.
> > >
> > > scsi_times_out() calls scsi_abort_command(), which queues an
> > > abort request.
> > >
> > > scmd_eh_abort_handler() calls scsi_try_to_abort_cmd(), which
> > > succeeds in aborting the READ.
> > >
> > > The READ command is retried (I didn't trace through the details
> > > of this). The retry fails with a Unit Attention (SK=6,
> > > ASC=0x29, Reset or Bus Device Reset Occurred).
> > >
> > > The READ command is retried a second time, and it times out
> > > again.
> > >
> > > This time around, scsi_times_out() calls scsi_abort_command()
> > > unsuccessfully (because the SCSI_EH_ABORT_SCHEDULED flag is
> > > still set).
> >
> > From the first time we sent the abort?
>
> Yes. As far as I can see, the only place where SCSI_EH_ABORT_SCHEDULED
> gets cleared is in scsi_eh_finish_cmd(), which never got called.
> After the successful abort, scmd_eh_abort_handler() went directly into
> its "if (rtn == SUCCESS)" and "scmd %p retry aborted command" /
> scsi_queue_insert() code paths.
OK, that's a bug ... I'll see if I can find it.
> > That sounds like a problem in
> > our state tracking.
>
> Could well be. How should I track it down further? Or can you suggest
> a patch just from this much information?
>
> > > As a result, scsi_error_handler() calls scsi_unjam_host(),
> > > which calls scsi_eh_get_sense().
> > >
> > > That routine calls scsi_request_sense(), which goes into
> > > scsi_send_eh_cmnd().
> >
> > I thought USB was autosense, so when it reports check condition, we
> > should already have sense ... or are we calling request_sense without
> > being sent a check condition status?
>
> usb-storage does autosense. As far as I can tell, the REQUEST SENSE is
> issued because there is no sense data, which is because the READ
> command is still running. Is this another state-tracking bug?
Yes ... we should only issue a request sense if we got a check condition
return. If we're doing it without that, something is wrong somewhere.
> > > The calls to shost->hostt->queuecommand() all fail, because the
> > > READ command is still running and usb-storage has a queue
> > > depth of 1. The error messages produced by these failures are
> > > disconcerting but not dangerous.
> > >
> > > Since the REQUEST SENSE command was never issued,
> > > scsi_eh_get_sense() returns 0.
> > >
> > > scsi_unjam_host() goes on to call scsi_eh_abort_cmds(), which
> > > does essentially nothing because the SCSI_EH_CANCEL_CMD flag
> > > for the only command on work_q is clear.
> > > scsi_eh_test_devices() returns 0 because check_list is empty
> > > and work_q isn't.
> > >
> > > scsi_unjam_host() then calls scsi_eh_ready_devs(). This
> > > routine ends up calling scsi_eh_bus_device_reset(), at which
> > > point usb-storage deadlocks as described above.
> >
> > OK, so in the case where the command can never complete (because the fw
> > has crashed), what should be the process for resetting the device so it
> > can again function?
>
> Simply abort the command. usb-storage will automatically carry out a
> bus reset under those conditions, and that's the best we can do.
>
> As you can see, the sequence above started out doing exactly this.
> The problems began when the command was retried. That time it didn't
> get aborted, thanks to the leftover SCSI_EH_ABORT_SCHEDULED flag.
> Things got worse from there.
OK, so I think we have three things to do
1. Investigate SCSI and fix it's abort state problem that's causing
it not to send the abort second time around
2. Fix usb-storage to fail a reset it can't do (i.e. device reset
with outstanding commands)
3. Find out why we're sending a spurious request sense.
I can look at 1 and 3 if you want to take 2.
James
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html