On Fri, 2025-03-28 at 12:36 -0400, Benjamin Marzinski wrote:
> On Tue, Mar 25, 2025 at 11:33:18PM +0100, Martin Wilck wrote:
> > On Mon, 2025-03-24 at 16:55 -0400, Benjamin Marzinski wrote:
> > > When a new device is added by the multipath command, multipathd
> > > may
> > > know
> > > of other paths that cannot be added to the device because they
> > > are
> > > currently offline. Instead of ignoring these paths, multipathd
> > > will
> > > now
> > > re-add them when they come back online. To do this, it multipathd
> > > needs
> > > a new device initialized state, INIT_OFFLINE, to track devices
> > > that
> > > were
> > > in INIT_OK, but could not be added to an existing multipath
> > > device
> > > because they were offline. These paths are handled along with the
> > > other
> > > uninitialized paths.
> > > 
> > > Signed-off-by: Benjamin Marzinski <bmarz...@redhat.com>
> > 
> > This looks good in general, but I'm not certain about INIT_OFFLINE.
> > The
> > property "path was temporarily offline while we tried to add it to
> > a
> > map" is not logically related to path initialization, IMO. Also, we
> > have plenty of (pp->initialized != INIT_xyz) conditionals in the
> > code,
> > and your patch touches only 2 of them. I find it difficult to rule
> > out
> > that some of them might be broken by the additional init state.
> > Have
> > you double-checked them all?
> 
> Yeah. I may have overlooked something, but I did look through them. I
> just double-checked pathinfo(), since if that's called with DI_ALL
> and
> succeeds, it will unconditionally set the path to INIT_OK.  There
> isn't
> a case where that could happen to a path in INIT_OFFLINE, but it
> would
> make sense to add a check there, just to make it obvious that this
> shouldn't move the path from INIT_OFFLINE to INIT_OK. Even if
> DI_CHECKER
> is set and pathinfo() discovers that the path has come back up, It's
> easier to just wait for checkerloop to check the path, and handle
> adding
> it the the multipath device there.
> 
> I should point out what this patch doesn't do, since that might clear
> up
> some cases where you would think I need to handle offline paths. This
> patch is only taking care of the case where a multipath device
> exists,
> but multipathd won't start monitoring it because it can't reload the
> table with the offline paths. It does not handle the case where
> offline
> paths exist, and that keeps multipathd from creating a device in the
> first place.
> 
> The easiest way for that to happen is if the device WWID is not in
> the
> wwids file and find_multipaths is set to "on" or "smart", where a
> device
> will get multipathed when a second path appears. In this case, you
> could
> add a single path and fully initialize it, then have that path go
> offline, and later add a second path. Multipathd would see that there
> are two paths, and try to create a multipath device. However the
> offline path would keep the device from being created. Since we
> call pathinfo(DI_CHECKER) in adopt_paths(), multipathd should know
> that the path is offline before it tries to create the device, so
> we could create the device without any offline paths, and then set
> those
> paths to INIT_OFFLINE, for later addition.
> 
> This case seemed less important to me than the one where we end up
> with
> an untracked mlutipath device. This problem has always existed and
> it's
> immediately obvious when it occurs (there's no multipath device), and
> yet I've never gotten a bug report for it. This makes sense because
> it's
> pretty hard to hit. Usually, the path will either not get initialized
> in
> the first place because it's offline (in which case we won't attempt
> to
> use it as part of a multipath device) or multipathd will create a
> multipath device using it as soon as the path gets added, leaving a
> very
> small window between when multipathd initializes the path and when
> it creates the multipath device. The larger window for the path to
> go offline only happens in the case where multipathd doesn't know
> to create a multipath device right away.
> 
> The case with the untracked device is pretty much as hard to hit, and
> has also always existed, but it's not obvious to the user that their
> multipath device isn't being tracked by multipathd. That seems much
> worse to me. I'm considering fixing the case where the device doesn't
> get created at all, but that can happen in multiple places, so I
> think
> it'll be a little messier. Also, I can't decide if multipathd should
> try
> to create the device with all the paths first, and only ignore the
> offline paths on retries, if the first create attempt fails, or if it
> should always skip the offline paths, at least if if just checked
> their
> state in adopt_paths(). Thoughts?

I agree that this is a corner case, and I have also never seen a bug
report about it. I wouldn't want to spend a lot of work on it.

Arguably the inconsistency is in the kernel, as the kernel doesn't mind
carrying around an offline device in a map after it has been added, but
refuses to add it in the first place while offline.

But I'm not keen on changing anything in that area, either.

Martin


Reply via email to