On 6/8/26 18:16, Boris Brezillon wrote: > On Mon, 08 Jun 2026 17:30:58 +0200 > Philipp Stanner <[email protected]> wrote: > >> On Mon, 2026-06-08 at 17:23 +0200, Danilo Krummrich wrote: >>> On Mon Jun 8, 2026 at 5:17 PM CEST, Philipp Stanner wrote: >>>> On Mon, 2026-06-08 at 17:01 +0200, Boris Brezillon wrote: >>>>> On Mon, 8 Jun 2026 16:24:37 +0200 >>>>> Philipp Stanner <[email protected]> wrote: >>>>> >>>>>> @@ -1020,11 +1024,20 @@ EXPORT_SYMBOL(dma_fence_wait_any_timeout); >>>>>> void dma_fence_set_deadline(struct dma_fence *fence, ktime_t deadline) >>>>>> { >>>>>> const struct dma_fence_ops *ops; >>>>>> + unsigned long flags; >>>>>> >>>>>> rcu_read_lock(); >>>>>> ops = rcu_dereference(fence->ops); >>>>>> - if (ops && ops->set_deadline && !dma_fence_is_signaled(fence)) >>>>>> + if (!ops || !ops->set_deadline) { >>>>>> + rcu_read_unlock(); >>>>>> + return; >>>>>> + } >>>>>> + >>>>>> + dma_fence_lock_irqsave(fence, flags); >>>>>> + if (!dma_fence_is_signaled_locked(fence)) >>>>>> ops->set_deadline(fence, deadline); >>>>> >>>>> You can't take the fence lock around ->set_deadline(), otherwise you'll >>>>> deadlock here [1] or here [2]. >>>>> >>>>>> + >>>>>> + dma_fence_unlock_irqrestore(fence, flags); >>>>>> rcu_read_unlock(); >>>>>> } >>>>> >>>>> >>>>> [1]https://elixir.bootlin.com/linux/v7.0.11/source/drivers/dma-buf/sw_sync.c#L182 >>>>> [2]https://elixir.bootlin.com/linux/v7.0.11/source/drivers/gpu/drm/msm/msm_fence.c#L139 >>>>> >> >> Oh, MSM actually doesn't btw, that's a false positive. That's a >> distinct spinlock on their fence context object. > > It's not, it's the same lock they attach to all their fences coming > from this context. It's just that this lock appears to be per-context, > like is the case for basically every driver, since the inline lock was > introduced only in this release cycle. > > Anyway, this is all stuff we can fix if people think it's okay to > protect dma_fence_ops calls with the fence lock. But my point remains: > each op has its own locking-rules, some are called with the fence lock > held (enable_signaling(), signaled()), others are not (set_deadline(), > get_xxx_name()), so we need to carefully audit each of those to make > sure: > > - calling with the lock held in the new places is not causing a > deadlock
Yeah, exactly that. For example the set_deadline() is intentionally not called with the fence lock held because that won't work for some use cases. The problem when you call ops with a lock held is always that this lock then becomes the outermost look held. In other words when you for example want to grab a power management lock to implement the deadline feature the framework enforces an order between the two locks which isn't desired. Regards, Christian. > - the returned data, if not a scalar, is protected by the RCU read lock > - any driver implementing ops that can be called without the lock held > need to hold on the device data for an RCU grace period > > The last bullet is probably the one I'm the most worried about, because > instead of a single rule that applies to all ops, we have various cases > based on whether some ops are implemented or not, but that's already > the case with deprecated ops like .release() or .wait(), so maybe > that's okay with the proper doc. > > If I were to choose, I'd probably go for a dedicated rwlock_t to > protect dma_fence_ops, so we can: > > - protect all dma_buf_ops::xx() consistently no matter the kind of op > - protect returned data (get_xxx_name()) with this lock instead of the > RCU read lock > > But the overhead of this extra lock might not be acceptable, dunno. > >> >> >> But yes, before we could upstream this, we would go through all the >> implementors like Danilo did, to find all the others. > > There's the two I pointed out, plus the array/chain containers I > mentioned, which are not problematic.
