On Fri, 2025-03-28 at 12:36 -0400, Benjamin Marzinski wrote: > On Tue, Mar 25, 2025 at 11:33:18PM +0100, Martin Wilck wrote: > > On Mon, 2025-03-24 at 16:55 -0400, Benjamin Marzinski wrote: > > > When a new device is added by the multipath command, multipathd > > > may > > > know > > > of other paths that cannot be added to the device because they > > > are > > > currently offline. Instead of ignoring these paths, multipathd > > > will > > > now > > > re-add them when they come back online. To do this, it multipathd > > > needs > > > a new device initialized state, INIT_OFFLINE, to track devices > > > that > > > were > > > in INIT_OK, but could not be added to an existing multipath > > > device > > > because they were offline. These paths are handled along with the > > > other > > > uninitialized paths. > > > > > > Signed-off-by: Benjamin Marzinski <bmarz...@redhat.com> > > > > This looks good in general, but I'm not certain about INIT_OFFLINE. > > The > > property "path was temporarily offline while we tried to add it to > > a > > map" is not logically related to path initialization, IMO. Also, we > > have plenty of (pp->initialized != INIT_xyz) conditionals in the > > code, > > and your patch touches only 2 of them. I find it difficult to rule > > out > > that some of them might be broken by the additional init state. > > Have > > you double-checked them all? > > Yeah. I may have overlooked something, but I did look through them. I > just double-checked pathinfo(), since if that's called with DI_ALL > and > succeeds, it will unconditionally set the path to INIT_OK. There > isn't > a case where that could happen to a path in INIT_OFFLINE, but it > would > make sense to add a check there, just to make it obvious that this > shouldn't move the path from INIT_OFFLINE to INIT_OK. Even if > DI_CHECKER > is set and pathinfo() discovers that the path has come back up, It's > easier to just wait for checkerloop to check the path, and handle > adding > it the the multipath device there. > > I should point out what this patch doesn't do, since that might clear > up > some cases where you would think I need to handle offline paths. This > patch is only taking care of the case where a multipath device > exists, > but multipathd won't start monitoring it because it can't reload the > table with the offline paths. It does not handle the case where > offline > paths exist, and that keeps multipathd from creating a device in the > first place. > > The easiest way for that to happen is if the device WWID is not in > the > wwids file and find_multipaths is set to "on" or "smart", where a > device > will get multipathed when a second path appears. In this case, you > could > add a single path and fully initialize it, then have that path go > offline, and later add a second path. Multipathd would see that there > are two paths, and try to create a multipath device. However the > offline path would keep the device from being created. Since we > call pathinfo(DI_CHECKER) in adopt_paths(), multipathd should know > that the path is offline before it tries to create the device, so > we could create the device without any offline paths, and then set > those > paths to INIT_OFFLINE, for later addition. > > This case seemed less important to me than the one where we end up > with > an untracked mlutipath device. This problem has always existed and > it's > immediately obvious when it occurs (there's no multipath device), and > yet I've never gotten a bug report for it. This makes sense because > it's > pretty hard to hit. Usually, the path will either not get initialized > in > the first place because it's offline (in which case we won't attempt > to > use it as part of a multipath device) or multipathd will create a > multipath device using it as soon as the path gets added, leaving a > very > small window between when multipathd initializes the path and when > it creates the multipath device. The larger window for the path to > go offline only happens in the case where multipathd doesn't know > to create a multipath device right away. > > The case with the untracked device is pretty much as hard to hit, and > has also always existed, but it's not obvious to the user that their > multipath device isn't being tracked by multipathd. That seems much > worse to me. I'm considering fixing the case where the device doesn't > get created at all, but that can happen in multiple places, so I > think > it'll be a little messier. Also, I can't decide if multipathd should > try > to create the device with all the paths first, and only ignore the > offline paths on retries, if the first create attempt fails, or if it > should always skip the offline paths, at least if if just checked > their > state in adopt_paths(). Thoughts?
I agree that this is a corner case, and I have also never seen a bug report about it. I wouldn't want to spend a lot of work on it. Arguably the inconsistency is in the kernel, as the kernel doesn't mind carrying around an offline device in a map after it has been added, but refuses to add it in the first place while offline. But I'm not keen on changing anything in that area, either. Martin