Ben and Martin,

Sorry for the delayed response. I sent this to you both initially instead of
to the list because I thought you might have some insight into the issue.
I have an 'alua' checker that I will post where I found this issue. I
didn't actually
find it in the 'tur' checker but I could see it is prone to the same
issue. I have a
test I created to show the issue called test_tur.c. I will send this
too if it helps.
The problem when I didn't use this patch was that my checker the entire
checker thread would stall. It didn't just correct itself. Until other
paths failed
the checker thread was deadlocked somehow.

In the 'tur' checker without this change, the 'running != 0' will lead
to PATH_PENDING
and MSG_TUR_RUNNING without looking at what those actual values are.
It might actually
be finished, but running hasn't changed to 0 yet, and ct->thread is not
0 on returning. I am not sure what extra conditions were true when my checker
thread deadlocks, but with this change the deadlock never happens.

Thanks,
Brian

On Fri, Feb 27, 2026 at 12:36 AM Martin Wilck <[email protected]> wrote:
>
> On Thu, 2026-02-26 at 13:48 -0500, Benjamin Marzinski wrote:
> > On Wed, Feb 25, 2026 at 11:00:44PM +0100, Martin Wilck wrote:
> > >
> > > I'm still undecided if it's worth changing this code in order to
> > > reinstate some paths 1s earlier in certain sitautions.
> > >
> > > Thoughts?
> >
> > I don't see any problems with this. But I also didn't see any real
> > problems with Brian's version (although this version does remove most
> > of
> > my theoretic issues. Like I said before. If we hang in the condlog,
> > we've got bigger problems than a stray checker thread). If we do
> > this,
> > we should probably document the uatomic_add_return() call,
>
> Of course, this was not an official patch submission, just an idea how
> we might improve this code. It's not exactly the same thing; Brian's
> patch meant simply ignoring ct->running, while mine would make the ct-
> >running check reliable, albeit with changed semantics.
>
> >  or just use
> > cmm_smp_mb() directly. But I still not sure what we get out of this,
> > other than a very slight chance to finish the checker a second
> > sooner.
>
> Yes. But it's also a fact that the simple atomic_read() test in the
> current code isn't reliable. It can well return true when ct->running
> has already been set to false. I guess the implementation tends to be
> as light-weight as possible, avoiding a memory barrier (or taking a
> lock with the side effect of a memory barrier) in the "fast path". That
> comes with the price that we may miss an already finished checker
> thread.
>
> > don't think that really makes a difference, but I don't really see
> > much
> > risk in the patch either, so I guess I'm fine either way.
>
> If the risk was truly zero, I would agree. But experience tells that
> it's very easy to get these things wrong for corner cases, and I tend
> to want to avoid even a minimal risk just for "very slight chance to
> finish the checker a second sooner", as you've expressed it very
> nicely.
>
> @Brian, feel free to try to convince us otherwise. It might be helpful
> if we understood your use case better.
>
> Martin



-- 
Brian Bunker
PURE Storage, Inc.
[email protected]

Reply via email to