On Mon, 08 Jun 2026 17:30:58 +0200 Philipp Stanner <[email protected]> wrote:
> On Mon, 2026-06-08 at 17:23 +0200, Danilo Krummrich wrote: > > On Mon Jun 8, 2026 at 5:17 PM CEST, Philipp Stanner wrote: > > > On Mon, 2026-06-08 at 17:01 +0200, Boris Brezillon wrote: > > > > On Mon, 8 Jun 2026 16:24:37 +0200 > > > > Philipp Stanner <[email protected]> wrote: > > > > > > > > > @@ -1020,11 +1024,20 @@ EXPORT_SYMBOL(dma_fence_wait_any_timeout); > > > > > void dma_fence_set_deadline(struct dma_fence *fence, ktime_t > > > > > deadline) > > > > > { > > > > > const struct dma_fence_ops *ops; > > > > > + unsigned long flags; > > > > > > > > > > rcu_read_lock(); > > > > > ops = rcu_dereference(fence->ops); > > > > > - if (ops && ops->set_deadline && !dma_fence_is_signaled(fence)) > > > > > + if (!ops || !ops->set_deadline) { > > > > > + rcu_read_unlock(); > > > > > + return; > > > > > + } > > > > > + > > > > > + dma_fence_lock_irqsave(fence, flags); > > > > > + if (!dma_fence_is_signaled_locked(fence)) > > > > > ops->set_deadline(fence, deadline); > > > > > > > > You can't take the fence lock around ->set_deadline(), otherwise you'll > > > > deadlock here [1] or here [2]. > > > > > > > > > + > > > > > + dma_fence_unlock_irqrestore(fence, flags); > > > > > rcu_read_unlock(); > > > > > } > > > > > > > > > > > > [1]https://elixir.bootlin.com/linux/v7.0.11/source/drivers/dma-buf/sw_sync.c#L182 > > > > [2]https://elixir.bootlin.com/linux/v7.0.11/source/drivers/gpu/drm/msm/msm_fence.c#L139 > > > > > > Oh, MSM actually doesn't btw, that's a false positive. That's a > distinct spinlock on their fence context object. It's not, it's the same lock they attach to all their fences coming from this context. It's just that this lock appears to be per-context, like is the case for basically every driver, since the inline lock was introduced only in this release cycle. Anyway, this is all stuff we can fix if people think it's okay to protect dma_fence_ops calls with the fence lock. But my point remains: each op has its own locking-rules, some are called with the fence lock held (enable_signaling(), signaled()), others are not (set_deadline(), get_xxx_name()), so we need to carefully audit each of those to make sure: - calling with the lock held in the new places is not causing a deadlock - the returned data, if not a scalar, is protected by the RCU read lock - any driver implementing ops that can be called without the lock held need to hold on the device data for an RCU grace period The last bullet is probably the one I'm the most worried about, because instead of a single rule that applies to all ops, we have various cases based on whether some ops are implemented or not, but that's already the case with deprecated ops like .release() or .wait(), so maybe that's okay with the proper doc. If I were to choose, I'd probably go for a dedicated rwlock_t to protect dma_fence_ops, so we can: - protect all dma_buf_ops::xx() consistently no matter the kind of op - protect returned data (get_xxx_name()) with this lock instead of the RCU read lock But the overhead of this extra lock might not be acceptable, dunno. > > > But yes, before we could upstream this, we would go through all the > implementors like Danilo did, to find all the others. There's the two I pointed out, plus the array/chain containers I mentioned, which are not problematic.
