On Tue, 2026-06-09 at 15:34 +0200, Christian König wrote:
> On 6/9/26 15:19, Philipp Stanner wrote:
> > And look what I found:
> > 
> > static inline bool
> > nouveau_cli_work_ready(struct dma_fence *fence)
> > {
> >  unsigned long flags;
> >  bool ret = true;
> > 
> >  dma_fence_lock_irqsave(fence, flags);
> >  if (!dma_fence_is_signaled_locked(fence))
> >  ret = false;
> >  dma_fence_unlock_irqrestore(fence, flags);
> > 
> >  if (ret == true)
> >  dma_fence_put(fence);
> >  return ret;
> > }
> > 
> > 
> > That looks weird, doesn't it?
> 
> No, that is pretty much expected.
> 
> This issue results because of the lock inversion/cleanup race between
> nouveau_fence_chan->lock and dropping the last reference.
> 
> That in turn is caused by the fact that enable_signaling is called
> with the fence lock held 


Are you referring to this comment from the documentation?

 * Since many implementations can call dma_fence_signal() even when before
 * @enable_signaling has been called there's a race window, where the
 * dma_fence_signal() might result in the final fence reference being
 * released and its memory freed. To avoid this, implementations of this
 * callback should grab their own reference using dma_fence_get(), to be
 * released when the fence is signalled (through e.g. the interrupt
 * handler).
 *
 * This callback is optional. If this callback is not present, then the
 * driver must always have signaling enabled.
 */
 bool (*enable_signaling)(struct dma_fence *fence);

> and delegates the signaling to the caller instead of doing it itself.

Who delegates what to whom?

enable_signaling() is called indirectly by someone who adds a callback.
If enable_signaling() is implemented, then the driver callback's only
job is to activate some sort of interrupt or worker that will signal
that fence at some point.


> This in turn means that you can't do proper cleanup after the
> signaling is done by grabbing driver specific locks.

Sure you can. If everything is properly synchronized.

// driver
dma_fence_signal(f);
// all callbacks can't reach our driver anymore
struct driver_fence = container_of(f);
lock(driver_fence->special_lock);
cleanup(f);
unlock(…);

Where is the deadlock?

I continue to fail to see it. Show me an example of some code that
would deadlock, please. Either fictive or from the kernel.

Nouveau's enable_signaling() callback does not take locks. Its
signaled() does not take locks.

I went through all implementors of signaled() and found that only xe
and nouveau invoke some function pointer that we want to investigate
closer:

User Way signaled() operates
-----------------------------------------------
amdgpu_userq_fence.c lockless
etnaviv lockless
i915 lockless
nouveau lockless?
radeon lockless
vc4 lockless
xe lockless?


Who will be deadlocking and with which locks?


> This is *exactly* the problem I'm trying to prevent here.
> 
> When the callbacks wouldn't be called with the fence lock held the
> Nouveau nouveau_fence_chan->lock and the fence lock would be
> completely independent.

You seem to circle around the idea that Nouveau's fence list is
protected with the shared lock?

That issue, if it exists, is solvable through the embedded lock.

Anyways, there is no issue. Neither Nouveau nor other drivers will
deadlock if we call ops->signaled() with the fence lock around.

> 
> This results in much better cleanup paths, fences which are
> independent of their issuers and in general much simpler handling for
> all dma_fence implementation backends because we don't need to worry
> all the time about lock inversions between the fence lock and
> internal driver locks.
> 
> So as far as I can see what you suggest here is exactly what has
> caused all the problems in the first place.
> 
> For the cleanup path in Rust you should be trivially able to use
> call_rcu() if the synchronized cleanup path would be causing issues
> (which I clearly agree on).

You can *satisfy the current dma_fence API* with call_rcu(), but as
Danilo states delaying work always has subtleties – plus I don't really
want to have an unnecessary call_rcu() call for tens of thousands of
fences per second.

And I would prefer if the dma_fence implementation just wouldn't be
racing. It is clearly desirable that a call to dma_fence_signal()
decouples the callbacks. That's even how you have designed it, just
with an unnecessary graceperiod thereafter.

I also doubt that it's robust that dma_fence_is_signaled() can return
true while the callbacks are not all executed.

And all the RCU dancing is fragile, where people need to guard API
calls with the RCU read lock instead of having the ability to rely on
their reference / refcount. It would be far safer if dma_fence would
guard against everything with the lock.

Summarizing:
 * enable_signaling() uses the lock already.
 * set_deadline() users use the lock around the entire callback.
 * dma_fence_is_signaled_locked() demands that it must always be
   possible to invoke ops->signaled() with a lock, as Danilo pointed
   out.
 * The few implementors of ops->signaled(), see my list above, all do
   atomic operations without any locks at all.
 * The other callbacks are either deprecated or not relevant in this
   regard.

So making dma_fence locking consistent would make all our problems
disappear, could probably even solve the unload-problem if
dma_fence_signal() becomes a hard, synchronous decoupling point, and it
comes at no proven cost.



P.

Reply via email to