Re: [PATCH 09/15] limpathpersist: redesign failed release workaround

Benjamin Marzinski Tue, 26 Aug 2025 14:07:55 -0700

On Tue, Aug 26, 2025 at 12:06:50PM +0200, Martin Wilck wrote:
> On Tue, 2025-08-26 at 10:44 +0200, Martin Wilck wrote:
> > On Mon, 2025-08-25 at 20:51 -0400, Benjamin Marzinski wrote:
> > > On Sun, Aug 24, 2025 at 05:26:50PM +0200, Martin Wilck wrote:
> > > 
> > > > 
> > > > > +     /*
> > > > > +      * Cannot free the reservation because the path that
> > > > > is
> > > > > holding it
> > > > > +      * is not usable. Workaround this by:
> > > > > +      * 1. Suspending the device
> > > > > +      * 2. Preempting the reservation to move it to a
> > > > > usable
> > > > > path
> > > > > +      *    (this removes the registered keys on all paths
> > > > > except
> > > > > the
> > > > > +      *    preempting one. Since the device is suspended,
> > > > > no
> > > > > IO
> > > > > can
> > > > > +      *    go to these unregistered paths and fail).
> > > > > +      * 3. Releasing the reservation on the path that now
> > > > > holds
> > > > > it.
> > > > > +      * 4. Resuming the device (since it no longer matters
> > > > > that
> > > > > most of
> > > > > +      *    that paths no longer have a registered key)
> > > > > +      * 5. Reregistering keys on all the paths
> > > > > +      */
> > > > > +
> > > > > +     if (!dm_simplecmd_noflush(DM_DEVICE_SUSPEND, mpp-
> > > > > >alias,
> > > > > 0))
> > > > > {
> > > > > +             condlog(0, "%s: release: failed to suspend dm
> > > > > device.",
> > > > 
> > > > Why do you use dm_simplecmd_noflush() here? Shouldn't queued IO
> > > > be
> > > > flushed from the dm device to avoid it being sent to paths that
> > > > are
> > > > going to be unregistered?
> > > > 
> > > 
> > > I'm pretty certain that DM will still flush all the IO from the
> > > target
> > > to DM core before suspending, even with dm_simplecmd_noflush() set.
> > > In
> > > request based multipath, queued IOs are never stored in the target.
> > > In
> > > bio based multipath, they are, but they will get flushed back up to
> > > DM
> > > core when suspending and queued there. No IO should happen through
> > > the
> > > target after the suspend, until the resume. dm_simplecmd_noflush()
> > > just
> > > keeps multipath from failing any IO that it had queueing, and it's
> > > only
> > > really necessary when we resize the device, because if we shrink
> > > the
> > > device, outstanding IO might be outside the new bounds.
> > 
> > OK, thanks for the clarification. I guess I've never fully understood
> > the way queueing works in dm.
> > 
> > What about queueing in the path devices? We'll be removing
> > registration
> > keys, so IO sent by the SCSI layer may end up with RESERVATION
> > CONFLICT
> > errors. To my understanding, without the DM_NOFLUSH_FLAG the kernel
> > will freeze the queue and flush everything, as if the device was
> > closed
> > during shutdown. If DM_NOFLUSH_FLAG is set, this won't happen. What's
> > preventing the SCSI layer from sending IO while we're modifying the
> > registrations?
> 
> I just re-read commit 2e93ccc1933d ("[PATCH] dm: suspend: add noflush
> pushback"), which explains that the noflush flag is mainly meant for
> handling queue_if_no_path situations, in particular being able to
> reload a queueing multipath map with new paths when all paths have
> failed. That's exactly what you wrote, queueing IO in generic dm rather
> than in the target in order to facilitate map reloads.
> 
> Here we're looking at a very different situation. RESERVATION CONFLICT
> errors won't cause queueing, because the kernel doesn't classify them
> as path errors.


True. But to be pendantic, since we're sending the PR commands directly
to the scsi devices and not through DM, multipath's queueing can't
effect them or be effected by them. DM won't ever see those ioctls or
any errors that occur from them.

> If we need to preempt a reservation held by another
> host, we'd rather not send IO to the device yet. If we have a
> reservation we want to give up, we should make sure to have all
> outstanding IO sent to the storage beforehand. In other situations like
> just registering keys without reservations being present, flushing
> won't be necessary, but it won't hurt either, AFAICS.

The only situations where we will do the suspend is when a multipath
device preempts the reservation key that it is holding, either because
it was explicitly told to (which is something that the SCSI spec
specifically calls out as valid, and is in-fact the only way to change
the reservation type of an existing reservation) or because we failed to
release a reservation, since the path holding the reservation is down.

If we do a flushing suspend, all queued IO will be failed, regardless of
the user's no_path_retry setting. We really don't want to do that unless
we have to (like when we're shrinking the device).

If we still have usable paths to the storage, that IO will get completed
before we suspend. The noflush code only matters if we don't have usable
paths. I would argue that even if we are releasing a reservation, we
want to continue queueing the IO if we end up in a queueing situation.
After we release a reservation, nobody is holding one, and all IO is
allowed. Plus, our paths still have keys registered (aside from that gap
when they are suspended) if another device grabs the reservation. So
there's no point in giving up on the queued IO when we release a
reservation, since we have every reason to believe it will be able to go
through when a path gets restored. 

> I am not sure whether it makes sense to try sending PRIN or PROUT
> commands to maps in queueing state. Depending on the nature of the
> errors that caused the paths to fail, we may or may not be able to send
> this type of commands. PRIN/PROUT must still work for devices in
> STANDBY state, but there are other situations where they wouldn't work,
> or would even hang.
> 
> I tend to think that sending PRIN or PROUT commands to queueing devices
> is an extreme corner case. Perhaps we should just refuse to do this 
> (realizing that we can't avoid a device entering queueing mode while
> we're processing PR commands, but that's an even more extreme corner
> case).

libmpathpersist runs the path checkers before it sends any commands and
doesn't send PRIN or PROUT commands to paths that are down. If there are
no usable paths, libmpathpersist will just return failure. I have no
plans to change any of this. Sure, it's unlikely that we will enter the
queueing state later, while processing the libmpathpersist command. But
the code doesn't get any simpler by doing a flushing suspend instead of
a noflush suspend, and we make sure that we won't fail IOs when
multipath is configured not to, even in this corner case. There are
plenty of places in the multipath code where we suspend (or reload a
table, where we implicitly suspend) knowing there are usable devices,
and we do noflush suspends anyways. I don't see why this is any
different.

And the window where we lose our last path while mpathpersist is running
is small but not super-small, especially in the case where we were told
to PREEMPT our key. libmpathpersist runs the path checkers pretty early.
There is still a bunch of setup and checking stuff it needs to do before
it actually gets around to suspending in preparation for a device
preempting its own key. So if the last path failed between when the path
checkers were run and when we suspend, we could have queued IO. Users
running a PREEMPT or a RELEASE command shouldn't expect it to cause
queued IO to fail if all the paths go down, and I don't see any big
benefit to offset the downside of allowing that possibility.

> 
> If we can assume that the device is not queueing, it might actually be
> better leave the NOFLUSH flag clear, making sure that the queue is 
> frozen and all IO is actually flushed before changing PR keys or
> reservations.

So, I'm assuming that you are talking about the lock_fs() call in
__dm_suspend() that gets skipped when we do a noflush suspend. This is
the only difference I can see that isn't just us opening ourselves up to
unwanted IO errors. And this admittedly will cause the fs to flush IO to
the device, which would be a good thing if we were about to lose our
ability to write to the disk. I have two arguments against caring about
this.

1. sg_persist doesn't force the filesystem to quiesce, so why should
mpathpersist? Nothing that is doing persistent reservations should rely
on that happening when they change a reservation.

2. We won't be losing our ability to write to the disk. Like I said,
there are only two cases where we try to preempt ourselves and trigger
this suspend: either we were explicitly told to preempt ourselves, or we
failed doing a release because the path holding the reservation was
down. In both cases, we keep our registered keys.

The only Persistent reservation types where a path with a registered key
can't do any IO it wants are types 1 (write exclusive) and 3 (exclusive
access), and we should probably disallow those types in libmpathpersist,
because they will never work. With those types of reservations, only the
I_T Nexus holding the reservation is allowed to do IO. This means that
only one path of a multipath device will ever be able to do IO. The
others will fail with RESEVATION CONFLICT errors. 

-Ben

> Am I getting something wrong here?
> 
> Martin

Re: [PATCH 09/15] limpathpersist: redesign failed release workaround

Reply via email to