On Tue, Aug 26, 2025 at 12:06:50PM +0200, Martin Wilck wrote: > On Tue, 2025-08-26 at 10:44 +0200, Martin Wilck wrote: > > On Mon, 2025-08-25 at 20:51 -0400, Benjamin Marzinski wrote: > > > On Sun, Aug 24, 2025 at 05:26:50PM +0200, Martin Wilck wrote: > > > > > > > > > > > > + /* > > > > > + * Cannot free the reservation because the path that > > > > > is > > > > > holding it > > > > > + * is not usable. Workaround this by: > > > > > + * 1. Suspending the device > > > > > + * 2. Preempting the reservation to move it to a > > > > > usable > > > > > path > > > > > + * (this removes the registered keys on all paths > > > > > except > > > > > the > > > > > + * preempting one. Since the device is suspended, > > > > > no > > > > > IO > > > > > can > > > > > + * go to these unregistered paths and fail). > > > > > + * 3. Releasing the reservation on the path that now > > > > > holds > > > > > it. > > > > > + * 4. Resuming the device (since it no longer matters > > > > > that > > > > > most of > > > > > + * that paths no longer have a registered key) > > > > > + * 5. Reregistering keys on all the paths > > > > > + */ > > > > > + > > > > > + if (!dm_simplecmd_noflush(DM_DEVICE_SUSPEND, mpp- > > > > > >alias, > > > > > 0)) > > > > > { > > > > > + condlog(0, "%s: release: failed to suspend dm > > > > > device.", > > > > > > > > Why do you use dm_simplecmd_noflush() here? Shouldn't queued IO > > > > be > > > > flushed from the dm device to avoid it being sent to paths that > > > > are > > > > going to be unregistered? > > > > > > > > > > I'm pretty certain that DM will still flush all the IO from the > > > target > > > to DM core before suspending, even with dm_simplecmd_noflush() set. > > > In > > > request based multipath, queued IOs are never stored in the target. > > > In > > > bio based multipath, they are, but they will get flushed back up to > > > DM > > > core when suspending and queued there. No IO should happen through > > > the > > > target after the suspend, until the resume. dm_simplecmd_noflush() > > > just > > > keeps multipath from failing any IO that it had queueing, and it's > > > only > > > really necessary when we resize the device, because if we shrink > > > the > > > device, outstanding IO might be outside the new bounds. > > > > OK, thanks for the clarification. I guess I've never fully understood > > the way queueing works in dm. > > > > What about queueing in the path devices? We'll be removing > > registration > > keys, so IO sent by the SCSI layer may end up with RESERVATION > > CONFLICT > > errors. To my understanding, without the DM_NOFLUSH_FLAG the kernel > > will freeze the queue and flush everything, as if the device was > > closed > > during shutdown. If DM_NOFLUSH_FLAG is set, this won't happen. What's > > preventing the SCSI layer from sending IO while we're modifying the > > registrations? > > I just re-read commit 2e93ccc1933d ("[PATCH] dm: suspend: add noflush > pushback"), which explains that the noflush flag is mainly meant for > handling queue_if_no_path situations, in particular being able to > reload a queueing multipath map with new paths when all paths have > failed. That's exactly what you wrote, queueing IO in generic dm rather > than in the target in order to facilitate map reloads. > > Here we're looking at a very different situation. RESERVATION CONFLICT > errors won't cause queueing, because the kernel doesn't classify them > as path errors.
True. But to be pendantic, since we're sending the PR commands directly to the scsi devices and not through DM, multipath's queueing can't effect them or be effected by them. DM won't ever see those ioctls or any errors that occur from them. > If we need to preempt a reservation held by another > host, we'd rather not send IO to the device yet. If we have a > reservation we want to give up, we should make sure to have all > outstanding IO sent to the storage beforehand. In other situations like > just registering keys without reservations being present, flushing > won't be necessary, but it won't hurt either, AFAICS. The only situations where we will do the suspend is when a multipath device preempts the reservation key that it is holding, either because it was explicitly told to (which is something that the SCSI spec specifically calls out as valid, and is in-fact the only way to change the reservation type of an existing reservation) or because we failed to release a reservation, since the path holding the reservation is down. If we do a flushing suspend, all queued IO will be failed, regardless of the user's no_path_retry setting. We really don't want to do that unless we have to (like when we're shrinking the device). If we still have usable paths to the storage, that IO will get completed before we suspend. The noflush code only matters if we don't have usable paths. I would argue that even if we are releasing a reservation, we want to continue queueing the IO if we end up in a queueing situation. After we release a reservation, nobody is holding one, and all IO is allowed. Plus, our paths still have keys registered (aside from that gap when they are suspended) if another device grabs the reservation. So there's no point in giving up on the queued IO when we release a reservation, since we have every reason to believe it will be able to go through when a path gets restored. > I am not sure whether it makes sense to try sending PRIN or PROUT > commands to maps in queueing state. Depending on the nature of the > errors that caused the paths to fail, we may or may not be able to send > this type of commands. PRIN/PROUT must still work for devices in > STANDBY state, but there are other situations where they wouldn't work, > or would even hang. > > I tend to think that sending PRIN or PROUT commands to queueing devices > is an extreme corner case. Perhaps we should just refuse to do this > (realizing that we can't avoid a device entering queueing mode while > we're processing PR commands, but that's an even more extreme corner > case). libmpathpersist runs the path checkers before it sends any commands and doesn't send PRIN or PROUT commands to paths that are down. If there are no usable paths, libmpathpersist will just return failure. I have no plans to change any of this. Sure, it's unlikely that we will enter the queueing state later, while processing the libmpathpersist command. But the code doesn't get any simpler by doing a flushing suspend instead of a noflush suspend, and we make sure that we won't fail IOs when multipath is configured not to, even in this corner case. There are plenty of places in the multipath code where we suspend (or reload a table, where we implicitly suspend) knowing there are usable devices, and we do noflush suspends anyways. I don't see why this is any different. And the window where we lose our last path while mpathpersist is running is small but not super-small, especially in the case where we were told to PREEMPT our key. libmpathpersist runs the path checkers pretty early. There is still a bunch of setup and checking stuff it needs to do before it actually gets around to suspending in preparation for a device preempting its own key. So if the last path failed between when the path checkers were run and when we suspend, we could have queued IO. Users running a PREEMPT or a RELEASE command shouldn't expect it to cause queued IO to fail if all the paths go down, and I don't see any big benefit to offset the downside of allowing that possibility. > > If we can assume that the device is not queueing, it might actually be > better leave the NOFLUSH flag clear, making sure that the queue is > frozen and all IO is actually flushed before changing PR keys or > reservations. So, I'm assuming that you are talking about the lock_fs() call in __dm_suspend() that gets skipped when we do a noflush suspend. This is the only difference I can see that isn't just us opening ourselves up to unwanted IO errors. And this admittedly will cause the fs to flush IO to the device, which would be a good thing if we were about to lose our ability to write to the disk. I have two arguments against caring about this. 1. sg_persist doesn't force the filesystem to quiesce, so why should mpathpersist? Nothing that is doing persistent reservations should rely on that happening when they change a reservation. 2. We won't be losing our ability to write to the disk. Like I said, there are only two cases where we try to preempt ourselves and trigger this suspend: either we were explicitly told to preempt ourselves, or we failed doing a release because the path holding the reservation was down. In both cases, we keep our registered keys. The only Persistent reservation types where a path with a registered key can't do any IO it wants are types 1 (write exclusive) and 3 (exclusive access), and we should probably disallow those types in libmpathpersist, because they will never work. With those types of reservations, only the I_T Nexus holding the reservation is allowed to do IO. This means that only one path of a multipath device will ever be able to do IO. The others will fail with RESEVATION CONFLICT errors. -Ben > Am I getting something wrong here? > > Martin