On Wed, Feb 25, 2026 at 11:00:44PM +0100, Martin Wilck wrote:
> On Wed, 2026-02-25 at 15:41 -0500, Benjamin Marzinski wrote:
> > 
> > I pretty much agree with Martin here. I do think this is likely safe,
> > but I don't think this is really a race. We don't return the status
> > of
> > the thread until the thread completes, and the thread signals its
> > completion by clearing ct->running. That's the same definition of
> > successful thread completion that we use for deciding to kill the
> > thread, and I don't see why it's not a valid one.
> > 
> > Yes we can get the thread result slightly before the thread has
> > completed. But by trusting that the thread will complete once pp-
> > >msgid
> > has changed, we are giving up the ability to kill it if it does hang
> > later, for instance in that condlog() call. On systemd systems, this
> > is
> > just some stdio print functions, but on non-systemd systems, this
> > uses
> > its own pthread locking, and could legitimately hang (although the
> > argument certainly can be made that if this is hanging, multipathd
> > has
> > bigger problems to worry about than an extra tur checker thread lying
> > around).  
> 
> After some further pondering, I wonder if we could do this:
> 
> --- a/libmultipath/checkers/tur.c
> +++ b/libmultipath/checkers/tur.c
> @@ -267,6 +267,7 @@ void *libcheck_thread(struct checker_context *ctx)
>         struct tur_checker_context *ct =
>                 container_of(ctx, struct tur_checker_context, ctx);
>         int state, running;
> +       dev_t devt;
>         short msgid;
>  
>         /* This thread can be canceled, so setup clean up */
> @@ -284,17 +285,17 @@ void *libcheck_thread(struct checker_context
> *ctx)
>         ct->state = state;
>         ct->msgid = msgid;
>         pthread_cond_signal(&ct->active);
> +       running = uatomic_xchg(&ct->running, 0);
>         pthread_mutex_unlock(&ct->lock);
>  
> -       condlog(4, "%d:%d : tur checker finished, state %s", major(ct-
> >devt),
> -               minor(ct->devt), checker_state_name(state));
> -
> -       running = uatomic_xchg(&ct->running, 0);
> +       devt = ct->devt;
>         if (!running)
>                 pause();
>  
>         tur_thread_cleanup_pop(ct);
>  
> +       condlog(4, "%d:%d : tur checker finished, state %s",
> major(devt),
> +               minor(devt), checker_state_name(state));
>         return ((void *)0);
>  }
>  
> @@ -356,7 +357,7 @@ int libcheck_check(struct checker * c)
>                                 pthread_mutex_unlock(&ct->lock);
>                         }
>                         ct->thread = 0;
> -               } else if (uatomic_read(&ct->running) != 0) {
> +               } else if (uatomic_add_return(&ct->running, 0) != 0) {
>                         condlog(3, "%d:%d : tur checker not finished",
>                                 major(ct->devt), minor(ct->devt));
>                         tur_status = PATH_PENDING;
> 
> 
> This way the thread would "lie" about itself being not running any more
> by setting ct->running to 0 before releasing the lock. AFAICS, in the
> worst case, this can cause the main thread not to cancel it even though
> it has timed out. But at this point it has already passed the point at
> which it's most likely to hang. The only point at which it could still
> block is the condlog(). By releasing ct before calling condlog() we can
> make sure that even if the thread now blocks, it can't prevent freeing
> of ct any more. We can basically forget about it at this point.
> 
> Replacing the uatomic_read() by a uatomic_add_return(..., 0) is a hack
> that adds a memory barrier, which, together with the mutex in
> libcheck_thread(), should make sure that the values of ct->running and
> ct->msgid are consistent at this point in libcheck_check().
> 
> Together, these changes should fix Brian's issue.
> 
> The added memory barrier might make libcheck_check() a bit slower. I'm
> not sure whether the difference would be noticeable. I'm not sure why
> we didn't use a barrier there to begin with, perhaps to make this check
> as light-weight as possible. The uatomic_read has been added in c3a839c
> ("libmultipath: cleanup tur locking").
> 
> I'm still undecided if it's worth changing this code in order to
> reinstate some paths 1s earlier in certain sitautions.
> 
> Thoughts?

I don't see any problems with this. But I also didn't see any real
problems with Brian's version (although this version does remove most of
my theoretic issues. Like I said before. If we hang in the condlog,
we've got bigger problems than a stray checker thread). If we do this,
we should probably document the uatomic_add_return() call, or just use
cmm_smp_mb() directly. But I still not sure what we get out of this,
other than a very slight chance to finish the checker a second sooner. I
don't think that really makes a difference, but I don't really see much
risk in the patch either, so I guess I'm fine either way.

-Ben

> 
> Martin


Reply via email to