Re: [PATCH 1/5] SCSI scanning and removal fixes

Luben Tuikov Thu, 08 Sep 2005 16:59:55 -0700

On 09/08/05 14:36, Alan Stern wrote:
> On Thu, 8 Sep 2005, Luben Tuikov wrote:
> 
> 
>>>No.  There are only two reset mechanisms available for USB Mass Storage.  
>>
>>I see.
>>
>>
>>>One is a class-specific reset command (which usb-storage issues when asked 
>>>to do a device reset),
>>
>>So as far as I can see it sends a reset on the wire to the device?
> 
> 
> Sort of.  It's not a low-level hardware reset, it's rather high-level.  
> If the device's firmware is busy doing something else, or is hung, or
> doesn't implement resets correctly, then the reset won't work.


So this is perfect for LU Reset.

>>>and the other is a USB port reset (which 
>>>usb-storage issues when asked to do a bus reset).
>>
>>As far as I see this resets the port on the host side and then
>>rediscovers the devices?
> 
> 
> That's right.  (Except there's only one device to rediscover.)  This is a 
> much lower-level action than the other sort, and is much more likely to 
> succeed.

So this is perfect for I_T Nexus Reset.

>>>Well, usb-storage doesn't need to "remove" a failed device.  The SCSI core 
>>>does a perfectly good job just by setting it off-line.  And since there 
>>>aren't other targets sharing the bus, that's good enough.
>>
>>Well what happens when I just unplug the device?
> 
> 
> For a brief time (maybe 1/4 second) all commands will return DID_ERROR
> and resets will return FAILED.  Then usbcore will realize that the device

Perfectly ok.

> has been unplugged and will call the usb-storage disconnect method, which
> in turn will call scsi_remove_host.

This is fine.  I haven't looked closely at the code, but note that
scsi_remove_host() should absolutely be called last.
 
>>What code?  What reinvention?
> 
> scsi_unjam_host plus the things that it calls.  Why add a private version 
> of all that to usb-storage when it already exists in scsi_error.c?

Because it is not implemented correctly.

Think about this:
        EH is function of scsi_host.
        You can define YOUR OWN EH! in usb storage, which you control!
        So that when you want to remove the host you know what position
        you're at. ;-)

>>You _need_ to implement a timeout and eh hook if you want to do this
>>in any sane way.
> 
> 
> No I don't.  There already is a timeout in the SCSI core
> (scmd->eh_timeout); I don't need to add another one.

Eh, you're missing my point completely.
Hopefully one day you can see what I mean.

>>You also need to implement kref via kobject for the devices which you/USB 
>>Storage
>>represent to SCSI Core.
> 
> Again, I don't need to.  The data used by usb-storage belongs to the
> private portion of the scsi_host, so it's already protected by the host's
> kref.

Well then, if your code is so correct, why are all these problems?

>> You also need to get rid of that horrible patchwork
>>of a solution: dev_semaphore.  Maybe you can take a look in the SAS code and
>>see how this is all done?
> 
> I assume you're referring to the way dev_semaphore is used in the 
> bus_reset routine?  It's sort of a preventative measure, because the SCSI 
> core and the driver core haven't quite settled down yet.

"Settled down" -- you mean, litte design if ever has been done writing this
thing?  Or do you mean: "one is waiting for the other to change in certain
ways to make it work"?

> Here's the idea.  We want to protect against a race between the error 
> handler doing a bus reset and the user doing rmmod usb-storage.  With the 
> current code, I believe scsi_remove_host does not wait for devices to go 
> out of error recovery.  Here is what might happen without dev_semaphore:
> 
>       Error handler thread            Rmmod thread
>       --------------------            ------------
>       eh calls usb-storage's bus_reset
>       -> calls usb_stor_port_reset
>        -> calls usb_lock_device_for_reset
>       -> calls usb_reset_device
>                                       Driver core calls usb-storage's
>                                               remove method
>                                       storage_disconnect calls
>                                               quiesce_and_remove_host
>                                       -> locks & releases dev_semaphore (*)
>                                       -> calls scsi_remove_host
>                                               (doesn't wait for eh)
>                                       -> returns to driver core

How can all this happen if a kref (module count) is upped by one,
due to SCSI Core using usb-storage?

>        -> usb_reset_device does
>               the port reset
> 
> (This picture is over-simplified because the driver core does some locking 
> of its own before calling the remove method.  Let's not worry about that.)
> 
> Now see what has happened.  usb-storage's remove method has returned, so
> it has given up all claims to the device.  It shouldn't do anything more
> that will affect the device.  And yet in the eh thread it's about to do a
> port reset!

Yes, this is because eh should be in USB Storage.

> 
> Since in reality the bus_reset routine acquires dev_semaphore, the rmmod
> thread will wait at (*) until the reset has finished.  Alternatively, if

OMG!

> scsi_remove_host waited until the device was out of error recovery, then
> we would be okay without using dev_semaphore.
> 
> The complication exists inside usb_lock_device_for_reset.  That routine
> ends up calling down_trylock and looping with a timeout, because of the 
> potential for a complicated deadlock that I have described before.

OMG!
 
> There's one more thing worth mentioning.  Thanks to the resets carried out
> internally by usb-storage, we may decide it's easiest not to implement the
> bus_reset method at all!  That would certainly solve all the problems with
> dev_semaphore.  :-)

What would solve your problem is implementing your own eh and timer hook.
More so for transports such as USB.

        Luben

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/5] SCSI scanning and removal fixes

Reply via email to