On Wed, 2026-02-11 at 16:43 +0100, Christian König wrote:
> On 2/11/26 11:06, Philipp Stanner wrote:
> > On Tue, 2026-02-10 at 11:01 +0100, Christian König wrote:
> > > At first glance it is counter intuitive to protect a constant function
> > > pointer table by RCU, but this allows modules providing the function
> > > table to unload by waiting for an RCU grace period.
> > 
> > I think that someone who does not already have a deep understanding
> > about dma-buf and fences will have much trouble understanding *why*
> > this patch is in the log and *what it achieves*.
> > 
> > Good commit messages are at least as important as good code. In
> > drm/sched for example I've been trying so many times to figure out why
> > certain hacks and changes were implemented, but all that git-blame ever
> > gave me was one liners, often hinting at some driver internal work
> > around ._.
> 
> How about something like this:
> 
> The fence ops of a dma_fence currently need to life as long as the dma_fence 
> is alive.
> 
> This means that the module who originally issued a dma_fence can't unload 
> unless all of them are freed up.

s/who/which
s/of them/fences

> 
> As first step to solve this issue protect the fence ops by RCU.
> 
> While it is counter intuitive to protect a constant function pointer table by 
> RCU it allows modules to wait for an RCU grace period to make sure that 
> nobody is executing their functions any more.

I'd say "… allows modules to wait for an RCU grace period before they
unload, to make sure that …"

As for the commit's purpose, see bottom of my reply

> 
> 
> > 
> > > 
> > > v2: make one the now duplicated lockdep warnings a comment instead.
> > > v3: Add more documentation to ->wait and ->release callback.
> > > v4: fix typo in documentation
> > > v5: rebased on drm-tip
> > > 
> > > Signed-off-by: Christian König <[email protected]>
> > > ---
> > >  drivers/dma-buf/dma-fence.c | 69 +++++++++++++++++++++++++------------
> > >  include/linux/dma-fence.h   | 29 ++++++++++++++--
> > >  2 files changed, 73 insertions(+), 25 deletions(-)
> > > 
> > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> > > index e05beae6e407..de9bf18be3d4 100644
> > > --- a/drivers/dma-buf/dma-fence.c
> > > +++ b/drivers/dma-buf/dma-fence.c
> > > @@ -522,6 +522,7 @@ EXPORT_SYMBOL(dma_fence_signal);
> > >  signed long
> > >  dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long 
> > > timeout)
> > >  {
> > > + const struct dma_fence_ops *ops;
> > >   signed long ret;
> > >  
> > >   if (WARN_ON(timeout < 0))
> > > @@ -533,15 +534,21 @@ dma_fence_wait_timeout(struct dma_fence *fence, 
> > > bool intr, signed long timeout)
> > >  
> > >   dma_fence_enable_sw_signaling(fence);
> > >  
> > > - if (trace_dma_fence_wait_start_enabled()) {
> > 
> > Why can wait_start_enabled() be removed? Is that related to the life
> > time decoupling or is it a separate topic?
> 
> It isn't removed, I've just removed the "if 
> (trace_dma_fence_wait_start_enabled())" optimization which is used by the 
> tracing subsystem as self-patching code (longer story).
> 
> The trace_dma_fence_wait_start() trace point function is still called a few 
> lines below.

OK.

> 
> > > -         rcu_read_lock();
> > > -         trace_dma_fence_wait_start(fence);
> > > + rcu_read_lock();
> > > + ops = rcu_dereference(fence->ops);
> > > + trace_dma_fence_wait_start(fence);
> > > + if (ops->wait) {
> > > +         /*
> > > +          * Implementing the wait ops is deprecated and not supported for
> > > +          * issuer independent fences, so it is ok to use the ops outside
> > 
> > s/issuer/issuers of
> 
> Fixed.
> 
> > And how do we know that this here is an independent fence?
> > What even is an "independent fence" – one with internal spinlock?
> 
> I rephrased the sentence a bit to make that more clearer:
> 
>                 /*
>                  * Implementing the wait ops is deprecated and not supported 
> for
>                  * issuers of fences who wants them to be independent of their

s/wants/need their lifetime to be

>                  * module after they signal, so it is ok to use the ops 
> outside
>                  * the RCU protected section.
>                  */
> 
> 
> > 
> > > +          * the RCU protected section.
> > > +          */
> > > +         rcu_read_unlock();
> > > +         ret = ops->wait(fence, intr, timeout);
> > > + } else {
> > >           rcu_read_unlock();
> > > - }
> > > - if (fence->ops->wait)
> > > -         ret = fence->ops->wait(fence, intr, timeout);
> > > - else
> > >           ret = dma_fence_default_wait(fence, intr, timeout);
> > > + }
> > 
> > The git diff here looks awkward. Do you use git format-patch --
> > histogram?
> 
> Nope, what's the matter?

The '}' is removed and then added again.

> 
> > >   if (trace_dma_fence_wait_end_enabled()) {
> > >           rcu_read_lock();
> > >           trace_dma_fence_wait_end(fence);
> 
> > >  
> > > @@ -1049,7 +1067,12 @@ __dma_fence_init(struct dma_fence *fence, const 
> > > struct dma_fence_ops *ops,
> > >   BUG_ON(!ops || !ops->get_driver_name || !ops->get_timeline_name);
> > >  
> > >   kref_init(&fence->refcount);
> > > - fence->ops = ops;
> > > + /*
> > > +  * At first glance it is counter intuitive to protect a constant
> > > +  * function pointer table by RCU, but this allows modules providing the
> > > +  * function table to unload by waiting for an RCU grace period.
> > 
> > Maybe add a sentence like "Fences can live longer than the module which
> > issued them."
> 
> Going to use the same as the commit message here as soon as we synced up on 
> that.

Jawohl.

> 
> > 
> > > +  */
> > > + RCU_INIT_POINTER(fence->ops, ops);
> > >   INIT_LIST_HEAD(&fence->cb_list);
> > >   fence->lock = lock;
> > >   fence->context = context;
> > > @@ -1129,11 +1152,12 @@ EXPORT_SYMBOL(dma_fence_init64);
> > >   */
> > >  const char __rcu *dma_fence_driver_name(struct dma_fence *fence)
> > >  {
> > > - RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
> > > -                  "RCU protection is required for safe access to 
> > > returned string");
> > > + const struct dma_fence_ops *ops;
> > >  
> > > + /* RCU protection is required for safe access to returned string */
> > > + ops = rcu_dereference(fence->ops);
> > >   if (!dma_fence_test_signaled_flag(fence))
> > > -         return (const char __rcu *)fence->ops->get_driver_name(fence);
> > > +         return (const char __rcu *)ops->get_driver_name(fence);
> > >   else
> > >           return (const char __rcu *)"detached-driver";
> > >  }
> > > @@ -1161,11 +1185,12 @@ EXPORT_SYMBOL(dma_fence_driver_name);
> > >   */
> > >  const char __rcu *dma_fence_timeline_name(struct dma_fence *fence)
> > >  {
> > > - RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
> > > -                  "RCU protection is required for safe access to 
> > > returned string");
> > > + const struct dma_fence_ops *ops;
> > >  
> > > + /* RCU protection is required for safe access to returned string */
> > > + ops = rcu_dereference(fence->ops);
> > >   if (!dma_fence_test_signaled_flag(fence))
> > > -         return (const char __rcu *)fence->ops->get_driver_name(fence);
> > > +         return (const char __rcu *)ops->get_driver_name(fence);
> > >   else
> > >           return (const char __rcu *)"signaled-timeline";
> > >  }
> > 
> > Did we make any progress in our conversation about removing those two
> > functions and callbacks? They're only used by i915.
> 
> Actually they are mostly used by the trace points and debugfs, so we 
> certainly can't remove them.

._.

> 
> But I'm really wondering why the heck i915 is using them?

I just got confused because I couldn't find the place anymore. Since
they have removed it since then.

In older kernels it was used for driver logging:

https://elixir.bootlin.com/linux/v6.15.11/source/drivers/gpu/drm/i915/i915_sw_fence.c#L437


> 
> > > 
> > > @@ -454,13 +465,19 @@ dma_fence_test_signaled_flag(struct dma_fence 
> > > *fence)
> > >  static inline bool
> > >  dma_fence_is_signaled_locked(struct dma_fence *fence)
> > >  {
> > > + const struct dma_fence_ops *ops;
> > > +
> > >   if (dma_fence_test_signaled_flag(fence))
> > >           return true;
> > >  
> > > - if (fence->ops->signaled && fence->ops->signaled(fence)) {
> > > + rcu_read_lock();
> > > + ops = rcu_dereference(fence->ops);
> > > + if (ops->signaled && ops->signaled(fence)) {
> > 
> > Maybe you can educate me a bit about RCU here – couldn't this still
> > race? If the ops were unloaded before you take rcu_read_lock(),
> > rcu_dereference() would give you an invalid pointer here since you
> > don't check for !ops, no?
> 
> Perfectly correct thinking, yes.
> 
> But the check for !ops is added in patch #2 when we actually start to set ops 
> = NULL when the fence signals.
> 
> I intentionally separated that because it is basically the second step in 
> making the solution to detach the fence ops from the module by RCU work.
> 
> We could merge the two patches together, but I think the separation actually 
> makes sense should anybody start to complain about the additional RCU 
> overhead.
> 

Alright, makes sense. However the above does not read correct..

But then my question would be: What's the purpose of this patch, what
does it solve or address atomically?

Adding RCU here does not yet change behavior and it does not solve the
unloading problem, does it?


If it's a mere preperational step and the patches should not be merged,
I'd guard the above with a simple comment like "Cleanup preparation.
'ops' can yet not be NULL, but this will be the case subsequently."


P.

Reply via email to