On Wed, 19 Mar 2014, Andreas Reis wrote:
> I've uploaded a dmesg with the new debugging patch to bugzilla:
> https://bugzilla.kernel.org/attachment.cgi?id=130041
Thanks. I have now managed to reproduce many of the features of this
problem on my own computer.
James, I will need your help (or help from somebody who understands the
SCSI error handler) to figure out how this problem should be fixed.
Basically, usb-storage deadlocks when the SCSI error handler invokes
the eh_device_reset_handler callback while a command is running. The
command has timed out and will never complete normally, because the
device's firmware has crashed. But usb-storage's device-reset routine
waits for the current command to finish, which brings everything to a
standstill.
Is this design wrong? That is, should the device-reset routine wait
for currently executing commands to finish, or should it abort them, or
what?
Or should the SCSI error handler abort the running command before
invoking the eh_device_reset_handler callback?
For the record, and in case anyone is curious, here's the detailed
sequence of events during my test:
sd issues a READ(10) command. For whatever reason, the device
goes nuts and the command times out.
scsi_times_out() calls scsi_abort_command(), which queues an
abort request.
scmd_eh_abort_handler() calls scsi_try_to_abort_cmd(), which
succeeds in aborting the READ.
The READ command is retried (I didn't trace through the details
of this). The retry fails with a Unit Attention (SK=6,
ASC=0x29, Reset or Bus Device Reset Occurred).
The READ command is retried a second time, and it times out
again.
This time around, scsi_times_out() calls scsi_abort_command()
unsuccessfully (because the SCSI_EH_ABORT_SCHEDULED flag is
still set).
As a result, scsi_error_handler() calls scsi_unjam_host(),
which calls scsi_eh_get_sense().
That routine calls scsi_request_sense(), which goes into
scsi_send_eh_cmnd().
The calls to shost->hostt->queuecommand() all fail, because the
READ command is still running and usb-storage has a queue
depth of 1. The error messages produced by these failures are
disconcerting but not dangerous.
Since the REQUEST SENSE command was never issued,
scsi_eh_get_sense() returns 0.
scsi_unjam_host() goes on to call scsi_eh_abort_cmds(), which
does essentially nothing because the SCSI_EH_CANCEL_CMD flag
for the only command on work_q is clear.
scsi_eh_test_devices() returns 0 because check_list is empty
and work_q isn't.
scsi_unjam_host() then calls scsi_eh_ready_devs(). This
routine ends up calling scsi_eh_bus_device_reset(), at which
point usb-storage deadlocks as described above.
(On Andreas's system, the first READ retry times out as opposed to the
second retry as on my computer. I doubt this makes any difference.)
I can't tell if this is all working as intended or if it went off the
tracks somewhere.
Thanks for any guidance.
Alan Stern
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html