On Tue, Mar 25, 2025 at 11:33:18PM +0100, Martin Wilck wrote:
> On Mon, 2025-03-24 at 16:55 -0400, Benjamin Marzinski wrote:
> > When a new device is added by the multipath command, multipathd may
> > know
> > of other paths that cannot be added to the device because they are
> > currently offline. Instead of ignoring these paths, multipathd will
> > now
> > re-add them when they come back online. To do this, it multipathd
> > needs
> > a new device initialized state, INIT_OFFLINE, to track devices that
> > were
> > in INIT_OK, but could not be added to an existing multipath device
> > because they were offline. These paths are handled along with the
> > other
> > uninitialized paths.
> > 
> > Signed-off-by: Benjamin Marzinski <bmarz...@redhat.com>
> 
> This looks good in general, but I'm not certain about INIT_OFFLINE. The
> property "path was temporarily offline while we tried to add it to a
> map" is not logically related to path initialization, IMO. Also, we
> have plenty of (pp->initialized != INIT_xyz) conditionals in the code,
> and your patch touches only 2 of them. I find it difficult to rule out
> that some of them might be broken by the additional init state. Have
> you double-checked them all?

Yeah. I may have overlooked something, but I did look through them. I
just double-checked pathinfo(), since if that's called with DI_ALL and
succeeds, it will unconditionally set the path to INIT_OK.  There isn't
a case where that could happen to a path in INIT_OFFLINE, but it would
make sense to add a check there, just to make it obvious that this
shouldn't move the path from INIT_OFFLINE to INIT_OK. Even if DI_CHECKER
is set and pathinfo() discovers that the path has come back up, It's
easier to just wait for checkerloop to check the path, and handle adding
it the the multipath device there.

I should point out what this patch doesn't do, since that might clear up
some cases where you would think I need to handle offline paths. This
patch is only taking care of the case where a multipath device exists,
but multipathd won't start monitoring it because it can't reload the
table with the offline paths. It does not handle the case where offline
paths exist, and that keeps multipathd from creating a device in the
first place.

The easiest way for that to happen is if the device WWID is not in the
wwids file and find_multipaths is set to "on" or "smart", where a device
will get multipathed when a second path appears. In this case, you could
add a single path and fully initialize it, then have that path go
offline, and later add a second path. Multipathd would see that there
are two paths, and try to create a multipath device. However the
offline path would keep the device from being created. Since we
call pathinfo(DI_CHECKER) in adopt_paths(), multipathd should know
that the path is offline before it tries to create the device, so
we could create the device without any offline paths, and then set those
paths to INIT_OFFLINE, for later addition.

This case seemed less important to me than the one where we end up with
an untracked mlutipath device. This problem has always existed and it's
immediately obvious when it occurs (there's no multipath device), and
yet I've never gotten a bug report for it. This makes sense because it's
pretty hard to hit. Usually, the path will either not get initialized in
the first place because it's offline (in which case we won't attempt to
use it as part of a multipath device) or multipathd will create a
multipath device using it as soon as the path gets added, leaving a very
small window between when multipathd initializes the path and when
it creates the multipath device. The larger window for the path to
go offline only happens in the case where multipathd doesn't know
to create a multipath device right away.

The case with the untracked device is pretty much as hard to hit, and
has also always existed, but it's not obvious to the user that their
multipath device isn't being tracked by multipathd. That seems much
worse to me. I'm considering fixing the case where the device doesn't
get created at all, but that can happen in multiple places, so I think
it'll be a little messier. Also, I can't decide if multipathd should try
to create the device with all the paths first, and only ignore the
offline paths on retries, if the first create attempt fails, or if it
should always skip the offline paths, at least if if just checked their
state in adopt_paths(). Thoughts?

> 
> Maybe it would be better to store this property in a separate flag in
> struct path?

Sure. I can switch to that.

> 2 more remarks below.
> 
> Thanks
> Martin
> 
> > ---
> >  libmultipath/print.c       |  1 +
> >  libmultipath/structs.h     |  5 ++++
> >  libmultipath/structs_vec.c |  5 ++++
> >  multipathd/main.c          | 58
> > ++++++++++++++++++++++++++++++++++++--
> >  4 files changed, 67 insertions(+), 2 deletions(-)
> > 
> > diff --git a/libmultipath/print.c b/libmultipath/print.c
> > index 00c03ace..ed8adebe 100644
> > --- a/libmultipath/print.c
> > +++ b/libmultipath/print.c
> > @@ -572,6 +572,7 @@ static int snprint_initialized(struct strbuf
> > *buff, const struct path * pp)
> >             [INIT_OK] = "ok",
> >             [INIT_REMOVED] = "removed",
> >             [INIT_PARTIAL] = "partial",
> > +           [INIT_OFFLINE] = "offline",
> >     };
> >     const char *str;
> >  
> > diff --git a/libmultipath/structs.h b/libmultipath/structs.h
> > index 28de9a7f..8644407f 100644
> > --- a/libmultipath/structs.h
> > +++ b/libmultipath/structs.h
> > @@ -258,6 +258,11 @@ enum initialized_states {
> >      * change uevent is received.
> >      */
> >     INIT_PARTIAL,
> > +   /*
> > +    * INIT_OFFLINE: paths that should be part of an existing
> > multipath
> > +    * device, but cannot be added because they are offline,
> > +    */
> > +   INIT_OFFLINE,
> >     INIT_LAST__,
> >  };
> >  
> > diff --git a/libmultipath/structs_vec.c b/libmultipath/structs_vec.c
> > index f6407e12..f122d056 100644
> > --- a/libmultipath/structs_vec.c
> > +++ b/libmultipath/structs_vec.c
> > @@ -389,6 +389,9 @@ static void orphan_paths(vector pathvec, struct
> > multipath *mpp, const char *reas
> >                             free_path(pp);
> >                     } else
> >                             orphan_path(pp, reason);
> > +           } else if (pp->initialized == INIT_OFFLINE &&
> > +                      strncmp(mpp->wwid, pp->wwid, WWID_SIZE)
> > == 0) {
> > +                   pp->initialized = INIT_OK;
> >             }
> >     }
> >  }
> > @@ -595,6 +598,8 @@ void sync_paths(struct multipath *mpp, vector
> > pathvec)
> >             found = 0;
> >             vector_foreach_slot(mpp->pg, pgp, j) {
> >                     if (find_slot(pgp->paths, (void *)pp) != -1)
> > {
> > +                           if (pp->initialized == INIT_OFFLINE)
> > +                                   pp->initialized = INIT_OK;
> >                             found = 1;
> >                             break;
> >                     }
> > diff --git a/multipathd/main.c b/multipathd/main.c
> > index 7aaae773..ecad5a4f 100644
> > --- a/multipathd/main.c
> > +++ b/multipathd/main.c
> > @@ -644,11 +644,44 @@ pr_register_active_paths(struct multipath *mpp)
> >     }
> >  }
> >  
> > +static void
> > +save_offline_paths(struct multipath *mpp, vector offline_paths)
> 
> const struct multipath *mpp ?

Sure.

> > +{
> > +   unsigned int i, j;
> > +   struct path *pp;
> > +   struct pathgroup *pgp;
> > +
> > +   vector_foreach_slot (mpp->pg, pgp, i)
> > +           vector_foreach_slot (pgp->paths, pp, j)
> > +                   if (pp->initialized == INIT_OK &&
> > +                       pp->sysfs_state == PATH_DOWN)
> > +                           store_path(offline_paths, pp);
> 
> You're not handling error from store_path here. I don't think you need
> to, but perhaps indicate this by adding a comment or calling
> (void)store_path(...).

Yeah. There's no real point in failing here if storing the path fails.
I can comment this.

> > +}
> > +
> > +static void
> > +handle_orphaned_offline_paths(vector offline_paths)
> > +{
> > +   unsigned int i;
> > +   struct path *pp;
> > +
> > +   vector_foreach_slot (offline_paths, pp, i)
> > +           if (pp->mpp == NULL)
> > +                   pp->initialized = INIT_OFFLINE;
> > +}
> > +
> > +static void
> > +cleanup_reset_vec(struct vector_s **v)
> > +{
> > +   vector_reset(*v);
> > +}
> > +
> >  static int
> >  update_map (struct multipath *mpp, struct vectors *vecs, int
> > new_map)
> >  {
> >     int retries = 3;
> >     char *params __attribute__((cleanup(cleanup_charp))) = NULL;
> > +   struct vector_s offline_paths_vec = { .allocated = 0 };
> > +   vector offline_paths
> > __attribute__((cleanup(cleanup_reset_vec))) = &offline_paths_vec;
> >  
> >  retry:
> >     condlog(4, "%s: updating new map", mpp->alias);
> > @@ -685,6 +718,9 @@ fail:
> >             return 1;
> >     }
> >  
> > +   if (new_map && retries < 0)
> > +           save_offline_paths(mpp, offline_paths);
> > +
> >     if (setup_multipath(vecs, mpp))
> >             return 1;
> >  
> > @@ -695,6 +731,9 @@ fail:
> >     if (mpp->prflag == PRFLAG_SET)
> >             pr_register_active_paths(mpp);
> >  
> > +   if (VECTOR_SIZE(offline_paths) != 0)
> > +           handle_orphaned_offline_paths(offline_paths);
> > +
> >     if (retries < 0)
> >             condlog(0, "%s: failed reload in new map update",
> > mpp->alias);
> >     return 0;
> > @@ -2793,7 +2832,8 @@ check_uninitialized_path(struct path * pp,
> > unsigned int ticks)
> >     struct config *conf;
> >  
> >     if (pp->initialized != INIT_NEW && pp->initialized !=
> > INIT_FAILED &&
> > -       pp->initialized != INIT_MISSING_UDEV)
> > +       pp->initialized != INIT_MISSING_UDEV &&
> > +       pp->initialized != INIT_OFFLINE)
> >             return CHECK_PATH_SKIPPED;
> >  
> >     if (pp->tick)
> > @@ -2849,7 +2889,8 @@ update_uninitialized_path(struct vectors *
> > vecs, struct path * pp)
> >     struct config *conf;
> >  
> >     if (pp->initialized != INIT_NEW && pp->initialized !=
> > INIT_FAILED &&
> > -       pp->initialized != INIT_MISSING_UDEV)
> > +       pp->initialized != INIT_MISSING_UDEV &&
> > +       pp->initialized != INIT_OFFLINE)
> >             return CHECK_PATH_SKIPPED;
> >  
> >     newstate = get_new_state(pp);
> > @@ -2875,6 +2916,19 @@ update_uninitialized_path(struct vectors *
> > vecs, struct path * pp)
> >                     free_path(pp);
> >                     return CHECK_PATH_REMOVED;
> >             }
> > +   } else if (pp->initialized == INIT_OFFLINE &&
> > +              (newstate == PATH_UP || newstate == PATH_GHOST))
> > {
> > +           pp->initialized = INIT_OK;
> > +           if (pp->recheck_wwid == RECHECK_WWID_ON &&
> > +               check_path_wwid_change(pp)) {
> > +                   condlog(0, "%s: path wwid change detected.
> > Removing",
> > +                           pp->dev);
> > +                   return handle_path_wwid_change(pp, vecs)?
> > +                                   CHECK_PATH_REMOVED :
> > +                                   CHECK_PATH_SKIPPED;
> > +           }
> > +           ev_add_path(pp, vecs, 1);
> > +           pp->tick = 1;
> >     }
> >     return CHECK_PATH_CHECKED;
> >  }


Reply via email to