On Wed, Mar 04, 2026 at 09:38:50AM -0800, Brian Bunker wrote: > Ben and Martin, > > Sorry for the delayed response. I sent this to you both initially instead of > to the list because I thought you might have some insight into the issue. > I have an 'alua' checker that I will post where I found this issue. I > didn't actually > find it in the 'tur' checker but I could see it is prone to the same > issue. I have a > test I created to show the issue called test_tur.c. I will send this > too if it helps.
If you have a test that can trigger the issue you are seeing, that would be really helpful it seeing what's going on here. -Ben > The problem when I didn't use this patch was that my checker the entire > checker thread would stall. It didn't just correct itself. Until other > paths failed > the checker thread was deadlocked somehow. > > In the 'tur' checker without this change, the 'running != 0' will lead > to PATH_PENDING > and MSG_TUR_RUNNING without looking at what those actual values are. > It might actually > be finished, but running hasn't changed to 0 yet, and ct->thread is not > 0 on returning. I am not sure what extra conditions were true when my checker > thread deadlocks, but with this change the deadlock never happens. > > Thanks, > Brian > > On Fri, Feb 27, 2026 at 12:36 AM Martin Wilck <[email protected]> wrote: > > > > On Thu, 2026-02-26 at 13:48 -0500, Benjamin Marzinski wrote: > > > On Wed, Feb 25, 2026 at 11:00:44PM +0100, Martin Wilck wrote: > > > > > > > > I'm still undecided if it's worth changing this code in order to > > > > reinstate some paths 1s earlier in certain sitautions. > > > > > > > > Thoughts? > > > > > > I don't see any problems with this. But I also didn't see any real > > > problems with Brian's version (although this version does remove most > > > of > > > my theoretic issues. Like I said before. If we hang in the condlog, > > > we've got bigger problems than a stray checker thread). If we do > > > this, > > > we should probably document the uatomic_add_return() call, > > > > Of course, this was not an official patch submission, just an idea how > > we might improve this code. It's not exactly the same thing; Brian's > > patch meant simply ignoring ct->running, while mine would make the ct- > > >running check reliable, albeit with changed semantics. > > > > > or just use > > > cmm_smp_mb() directly. But I still not sure what we get out of this, > > > other than a very slight chance to finish the checker a second > > > sooner. > > > > Yes. But it's also a fact that the simple atomic_read() test in the > > current code isn't reliable. It can well return true when ct->running > > has already been set to false. I guess the implementation tends to be > > as light-weight as possible, avoiding a memory barrier (or taking a > > lock with the side effect of a memory barrier) in the "fast path". That > > comes with the price that we may miss an already finished checker > > thread. > > > > > don't think that really makes a difference, but I don't really see > > > much > > > risk in the patch either, so I guess I'm fine either way. > > > > If the risk was truly zero, I would agree. But experience tells that > > it's very easy to get these things wrong for corner cases, and I tend > > to want to avoid even a minimal risk just for "very slight chance to > > finish the checker a second sooner", as you've expressed it very > > nicely. > > > > @Brian, feel free to try to convince us otherwise. It might be helpful > > if we understood your use case better. > > > > Martin > > > > -- > Brian Bunker > PURE Storage, Inc. > [email protected]
