On Thu, 2017-03-02 at 16:45 +0200, Israel Rukshin wrote:
> The bug reproduce when unloading srp module with one port down.
> device_del() hangs when __scsi_remove_device() get scsi_device with
> state SDEV_OFFLINE or SDEV_TRANSPORT_OFFLINE.
> It hangs because device_del() is trying to send sync cache command
> when the device is offline but with SDEV_CANCEL status.
> The status was changed to SDEV_CANCEL by __scsi_remove_device()
> before it calls to device_del().

It is not device_del() but sd_shutdown() that submits the SYNCHRONIZE
CACHE command. It should also be explained in the commit message why
the block layer timeout mechanism did not cause the SYNCHRONIZE CACHE
commands to fail after the timeout expired.

> Fixes: e619e6cbe (Revert "scsi: Fix a bdi reregistration race")

This does not make sense. A revert is never the original introduction
of a bug but at most a reintroduction. Please refer to the commit that
originally introduced this hang.

> --- a/drivers/scsi/scsi_sysfs.c
> +++ b/drivers/scsi/scsi_sysfs.c
> @@ -1282,6 +1282,8 @@ void __scsi_remove_device(struct scsi_device *sdev)
>               return;
>  
>       if (sdev->is_visible) {
> +             enum scsi_device_state oldstate = sdev->sdev_state;
> +
>               if (scsi_device_set_state(sdev, SDEV_CANCEL) != 0)
>                       return;
>  
> @@ -1289,6 +1291,15 @@ void __scsi_remove_device(struct scsi_device *sdev)
>               device_unregister(&sdev->sdev_dev);
>               transport_remove_device(dev);
>               scsi_dh_remove_device(sdev);
> +
> +             /*
> +              * Fail new requests if the old state was offline.
> +              * This avoids from device_del() to hang.
> +              */
> +             if (oldstate == SDEV_TRANSPORT_OFFLINE ||
> +                 oldstate == SDEV_OFFLINE)
> +                     blk_set_queue_dying(sdev->request_queue);
> +
>               device_del(dev);
>       } else
>               put_device(&sdev->sdev_dev);

Please refer to sd_shutdown() instead of device_del() in the above comment,
and also explain why the block layer timeout mechanism did not cause the
SYNCHRONIZE CACHE commands to fail.

Thanks,

Bart.

Reply via email to