On Sat, 2018-08-04 at 18:01 +0800, Ming Lei wrote:
> On Sat, Aug 4, 2018 at 8:03 AM, Bart Van Assche <[email protected]>
> wrote:
> >
> > diff --git a/block/blk-pm.c b/block/blk-pm.c
> > index 2a4632d0be4b..0708185ead61 100644
> > --- a/block/blk-pm.c
> > +++ b/block/blk-pm.c
> > @@ -107,17 +107,31 @@ int blk_pre_runtime_suspend(struct request_queue *q)
> > if (!q->dev)
> > return ret;
> >
> > + WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);
> > +
> > blk_pm_runtime_lock(q);
> > + /*
> > + * Switch to preempt-only mode before calling synchronize_rcu() such
> > + * that later blk_queue_enter() calls see the preempt-only state.
> > See
> > + * also http://lwn.net/Articles/573497/.
> > + */
> > + blk_set_pm_only(q);
> > + ret = -EBUSY;
> > + if (percpu_ref_read(&q->q_usage_counter) == 1) {
> > + synchronize_rcu();
> > + if (percpu_ref_read(&q->q_usage_counter) == 1)
> > + ret = 0;
> > + }
>
> The above code looks too tricky, and the following issue might exist in case
> of potential WRITEs re-order from blk_queue_enter().
>
> 1) suppose there are one in-flight request not completed yet
>
> 2) observer is the percpu_ref_read() following synchronize_rcu().
>
> 3) inside blk_queue_enter(), WRITE in
> percpu_ref_tryget_live()/percpu_ref_put()
> can be re-ordered from view of the observer in 2)
>
> 4) then percpu_ref_read() may return 1 even though the only in-flight
> request isn't completed
>
> 5) suspend still moves on, and data loss may be triggered
>
> Cc Paul, and Alan have been CCed already, so we have several memory
> model experts here, :-), hope they may help to clarify if this reorder case
> may exist.
Hello Ming,
There are two cases:
(a) The percpu_ref_tryget_live() call in (3) is triggered by a runtime power
state transition (RPM_*).
(b) The percpu_ref_tryget_live() call in (3) is not triggered by a runtime
power state transition.
In case (b) the driver that submits the request should protect the request
with scsi_autopm_get_device() / scsi_autopm_put_device() or by any other
mechanism that prevents runtime power state transitions. Since
scsi_autopm_get_device() is called before blk_queue_enter(), since
scsi_autopm_get_device() only returns after it has resumed a device and since
scsi_autopm_get_device() prevents runtime suspend, the function
blk_pre_runtime_suspend() won't be called concurrently with blk_queue_enter()
in case (b).
Since the power management serializes power state transitions also for case
(a) the race described above won't happen. blk_pre_runtime_suspend() will have
finished before any RQF_PM requests are submitted.
Bart.