Re: [PATCH 0/1] checkers: fix race in async TUR and ALUA result collection

Martin Wilck Wed, 25 Feb 2026 14:00:59 -0800

On Wed, 2026-02-25 at 15:41 -0500, Benjamin Marzinski wrote:
> 
> I pretty much agree with Martin here. I do think this is likely safe,
> but I don't think this is really a race. We don't return the status
> of
> the thread until the thread completes, and the thread signals its
> completion by clearing ct->running. That's the same definition of
> successful thread completion that we use for deciding to kill the
> thread, and I don't see why it's not a valid one.
> 
> Yes we can get the thread result slightly before the thread has
> completed. But by trusting that the thread will complete once pp-
> >msgid
> has changed, we are giving up the ability to kill it if it does hang
> later, for instance in that condlog() call. On systemd systems, this
> is
> just some stdio print functions, but on non-systemd systems, this
> uses
> its own pthread locking, and could legitimately hang (although the
> argument certainly can be made that if this is hanging, multipathd
> has
> bigger problems to worry about than an extra tur checker thread lying
> around).


After some further pondering, I wonder if we could do this:

--- a/libmultipath/checkers/tur.c
+++ b/libmultipath/checkers/tur.c
@@ -267,6 +267,7 @@ void *libcheck_thread(struct checker_context *ctx)
        struct tur_checker_context *ct =
                container_of(ctx, struct tur_checker_context, ctx);
        int state, running;
+       dev_t devt;
        short msgid;
 
        /* This thread can be canceled, so setup clean up */
@@ -284,17 +285,17 @@ void *libcheck_thread(struct checker_context
*ctx)
        ct->state = state;
        ct->msgid = msgid;
        pthread_cond_signal(&ct->active);
+       running = uatomic_xchg(&ct->running, 0);
        pthread_mutex_unlock(&ct->lock);
 
-       condlog(4, "%d:%d : tur checker finished, state %s", major(ct-
>devt),
-               minor(ct->devt), checker_state_name(state));
-
-       running = uatomic_xchg(&ct->running, 0);
+       devt = ct->devt;
        if (!running)
                pause();
 
        tur_thread_cleanup_pop(ct);
 
+       condlog(4, "%d:%d : tur checker finished, state %s",
major(devt),
+               minor(devt), checker_state_name(state));
        return ((void *)0);
 }
 
@@ -356,7 +357,7 @@ int libcheck_check(struct checker * c)
                                pthread_mutex_unlock(&ct->lock);
                        }
                        ct->thread = 0;
-               } else if (uatomic_read(&ct->running) != 0) {
+               } else if (uatomic_add_return(&ct->running, 0) != 0) {
                        condlog(3, "%d:%d : tur checker not finished",
                                major(ct->devt), minor(ct->devt));
                        tur_status = PATH_PENDING;


This way the thread would "lie" about itself being not running any more
by setting ct->running to 0 before releasing the lock. AFAICS, in the
worst case, this can cause the main thread not to cancel it even though
it has timed out. But at this point it has already passed the point at
which it's most likely to hang. The only point at which it could still
block is the condlog(). By releasing ct before calling condlog() we can
make sure that even if the thread now blocks, it can't prevent freeing
of ct any more. We can basically forget about it at this point.

Replacing the uatomic_read() by a uatomic_add_return(..., 0) is a hack
that adds a memory barrier, which, together with the mutex in
libcheck_thread(), should make sure that the values of ct->running and
ct->msgid are consistent at this point in libcheck_check().

Together, these changes should fix Brian's issue.

The added memory barrier might make libcheck_check() a bit slower. I'm
not sure whether the difference would be noticeable. I'm not sure why
we didn't use a barrier there to begin with, perhaps to make this check
as light-weight as possible. The uatomic_read has been added in c3a839c
("libmultipath: cleanup tur locking").

I'm still undecided if it's worth changing this code in order to
reinstate some paths 1s earlier in certain sitautions.

Thoughts?

Martin

Re: [PATCH 0/1] checkers: fix race in async TUR and ALUA result collection

Reply via email to