On 6/9/26 12:42, Philipp Stanner wrote: > On Tue, 2026-06-09 at 12:26 +0200, Christian König wrote: >> On 6/9/26 07:52, Philipp Stanner wrote: >> ... >>>> >>>> In detail calling the callbacks without holding locks allows all >>>> implementations who need it to explicitly take locks in the order they >>>> want. >>> >>> Didn't you say a few mails above that the implementation should not use >>> the fence lock for its own purposes? >> >> The usual use case the drivers have is this here: >> >> dma_fence_lock_irqsave(fence, flags); >> was_signaled = dma_fence_test_signaled(fence); >> if (!was_signaled) >> dma_fence_signal(fence); >> dma_fence_unlock_irqrestore(fence, flags); >> >> if (!was_signaled) > > If cleanup() touches fence data, this is now exactly the race that I am > concerned with. > > With the current design, you'd actually need a synchronize_rcu() here, > which you definitely do not want in a hot path :( > >> cleanup(); >> >> This is actually what you and me came up with for the KFD when we >> removed the return code for dma_fence_signal(). > > Well, what we came up with for that rare case was > > was_signaled = dma_fence_check_and_signal(fence); > if (!was_signaled) > cleanup(); > > It seems that many of the other use cases where the fence lock is taken > by drivers ultimately are rooted in the (IMO) design mistake that > dma_fence_set_error() exists. > > There should just be > > void > dma_fence_signal(struct dma_fence *f, int err); > > which also conveniently forces all signalers to really carefully think > about whether the fence's associated operations succeeded. Correctly > representing the "true fence error state" to all consumers is vital for > sane system behavior, as you continuously have to point out.
Yeah that came to my mind before as well. >> >> Taking the lock around the enable_signaling() callback has the exact >> same reason, preventing the fence from signaling between testing and >> calling dma_fence_signal(). The problem with that approach is that >> cleanup() now suddenly runs under the fence lock as well. > > That problem does not exist if we re-design dma_fence like that: > > // driver > dma_fence_signal(f); // revokes all accesses to our driver through backend_ops > // synchronize_rcu() now unnecessary \o/ > cleanup(f); // We know that all accessors are gone > dma_fence_put(f); Yeah and exactly that doesn't work. Just think about the Nouveau case when you have your fences on a double linked list. When the fence lock is independent, e.g. have a separate lock for each fence then this lock can't protect this double linked list. So your cleanup path needs to take a lock which protects the list, but you then run into lock inversion. >> >> So you are left with few options: Either the fence lock is external, >> which we don't want because that make the fence non-independent, or >> cleanup() defers work to irq_work or work_structs, which creates >> numerous lifetime issues. > > Yup, this is uncool and we want to avoid that. > > But these seem to be the options > > 1. Ensure proper synchronization > 2. Wait for a grace period in a hot path > 3. Defer cleanup() with some delay mechanism > > #1 is by far the cleanest approach. I still cannot see any downside, > and quite a few upsides. > > https://elixir.bootlin.com/linux/v7.1-rc6/source/drivers/dma-buf/dma-fence.c#L1025 > > ^ is already racing with the signaled check. Yeah so what? That is just an opportunistic check. > I'm going through the users right now, and it seems all we need to > ensure to implement this change is to audit all ops->signaled() for > usage of the fence_lock and see if we can port them. > >> >> When enable_signaling would be independent of the fence lock we could >> avoid all of this and just use a normal spin_lock() for the cleanup. > > How's that related to our problem? See the explanation above for the Nouveau example. As far as I can see the approach you want to implement here is a NO-GO from my side. Regards, Christian. > > P.
