dep: Add DRM dependency queue layer

Matthew Brost Sat, 21 Mar 2026 23:43:29 -0700

On Thu, Mar 19, 2026 at 10:57:29AM +0100, Boris Brezillon wrote:
> On Wed, 18 Mar 2026 15:40:35 -0700
> Matthew Brost <[email protected]> wrote:
> 
> > > > 
> > > > So I don’t think Rust natively solves these types of problems, although
> > > > I’ll concede that it does make refcounting a bit more sane.  
> > > 
> > > Rust won't magically defer the cleanup, nor will it dictate how you want
> > > to do the queue teardown, those are things you need to implement. But it
> > > should give visibility about object lifetimes, and guarantee that an
> > > object that's still visible to some owners is usable (the notion of
> > > usable is highly dependent on the object implementation).
> > > 
> > > Just a purely theoretical example of a multi-step queue teardown that
> > > might be possible to encode in rust:
> > > 
> > > - MyJobQueue<Usable>: The job queue is currently exposed and usable.
> > >   There's a ::destroy() method consuming 'self' and returning a
> > >   MyJobQueue<Destroyed> object
> > > - MyJobQueue<Destroyed>: The user asked for the workqueue to be
> > >   destroyed. No new job can be pushed. Existing jobs that didn't make
> > >   it to the FW queue are cancelled, jobs that are in-flight are
> > >   cancelled if they can, or are just waited upon if they can't. When
> > >   the whole destruction step is done, ::destroyed() is called, it
> > >   consumes 'self' and returns a MyJobQueue<Inactive> object.
> > > - MyJobQueue<Inactive>: The queue is no longer active (HW doesn't have
> > >   any resources on this queue). It's ready to be cleaned up.
> > >   ::cleanup() (or just ::drop()) defers the cleanup of some inner
> > >   object that has been passed around between the various
> > >   MyJobQueue<State> wrappers.
> > > 
> > > Each of the state transition can happen asynchronously. A state
> > > transition consumes the object in one state, and returns a new object
> > > in its new state. None of the transition involves dropping a refcnt,
> > > ownership is just transferred. The final MyJobQueue<Inactive> object is
> > > the object we'll defer cleanup on.
> > > 
> > > It's a very high-level view of one way this can be implemented (I'm
> > > sure there are others, probably better than my suggestion) in order to
> > > make sure the object doesn't go away without the compiler enforcing
> > > proper state transitions.
> > >   
> > 
> > I'm sure Rust can implement this. My point about Rust is it doesn't
> > magically solve hard software arch probles, but I will admit the
> > ownership model, way it can enforce locking at compile time is pretty
> > cool.
> 
> It's not quite about rust directly solving those problems for you, it's
> about rust forcing you to think about those problems in the first
> place. So no, rust won't magically solve your multi-step teardown with
> crazy CPU <-> Device synchronization etc, but it allows you to clearly
> identify those steps, and think about how you want to represent them
> without abusing other concepts, like object refcounting/ownership.
> Everything I described, you can code it in C BTW, it's just that C is so
> lax that you can also abuse other stuff to get to your ends, which might
> or might not be safe, but more importantly, will very likely obfuscate
> the code (even with good docs).
>


This is very well put, and I completely agree. Sorry—I get annoyed by
the Rust comments. It solves some classes of problems, but it doesn’t
magically solve complex software architecture issues that need to be
thoughtfully designed.

> > 
> > > > > > > +/**
> > > > > > > + * DOC: DRM dependency fence
> > > > > > > + *
> > > > > > > + * Each struct drm_dep_job has an associated struct 
> > > > > > > drm_dep_fence that
> > > > > > > + * provides a single dma_fence (@finished) signalled when the 
> > > > > > > hardware
> > > > > > > + * completes the job.
> > > > > > > + *
> > > > > > > + * The hardware fence returned by &drm_dep_queue_ops.run_job is 
> > > > > > > stored as
> > > > > > > + * @parent. @finished is chained to @parent via 
> > > > > > > drm_dep_job_done_cb() and
> > > > > > > + * is signalled once @parent signals (or immediately if 
> > > > > > > run_job() returns
> > > > > > > + * NULL or an error).    
> > > > > > 
> > > > > > I thought this fence proxy mechanism was going away due to recent 
> > > > > > work being
> > > > > > carried out by Christian?
> > > > > >     
> > > > 
> > > > Consider the case where a driver’s hardware fence is implemented as a
> > > > dma-fence-array or dma-fence-chain. You cannot install these types of
> > > > fences into a dma-resv or into syncobjs, so a proxy fence is useful
> > > > here.  
> > > 
> > > Hm, so that's a driver returning a dma_fence_array/chain through
> > > ::run_job()? Why would we not want to have them directly exposed and
> > > split up into singular fence objects at resv insertion time (I don't
> > > think syncobjs care, but I might be wrong). I mean, one of the point  
> > 
> > You can stick dma-fence-arrays in syncobjs, but not chains.
> 
> Yeah, kinda makes sense, since timeline syncobjs use chains, and if the
> chain reject inner chains, it won't work.
> 

+1, Exactly.

> > 
> > Neither dma-fence-arrays/chain can go into dma-resv.
> 
> They can't go directly in it, but those can be split into individual
> fences and be inserted, which would achieve the same goal.
> 

Yes, but now it becomes a driver problem (maybe only mine) rather than
an opaque job fence that can be inserted. In my opinion, it’s best to
keep the job vs. hardware fence abstraction.

> > 
> > Hence why disconnecting a job's finished fence from hardware fence IMO
> > is good idea to keep so gives drivers flexiblity on the hardware fences.
> 
> The thing is, I'm not sure drivers were ever meant to expose containers
> through ::run_job().
> 

Well there haven't been any rules...

> > e.g., If this design didn't have a job's finished fence, I'd have to
> > open code one Xe side.
> 
> There might be other reasons we'd like to keep the
> drm_sched_fence-like proxy that I'm missing. But if it's the only one,
> and the fence-combining pattern you're describing is common to multiple
> drivers, we can provide a container implementation that's not a
> fence_array, so you can use it to insert driver fences into other
> containers. This way we wouldn't force the proxy model to all drivers,
> but we would keep the code generic/re-usable.
> 
> > 
> > > behind the container extraction is so fences coming from the same
> > > context/timeline can be detected and merged. If you insert the
> > > container through a proxy, you're defeating the whole fence merging
> > > optimization.  
> > 
> > Right. Finished fences have single timeline too...
> 
> Aren't you faking a single timeline though if you combine fences from
> different engines running at their own pace into a container?
> 
> > 
> > > 
> > > The second thing is that I'm not sure drivers were ever supposed to
> > > return fence containers in the first place, because the whole idea
> > > behind a fence context is that fences are emitted/signalled in
> > > seqno-order, and if the fence is encoding the state of multiple
> > > timelines that progress at their own pace, it becomes tricky to control
> > > that. I guess if it's always the same set of timelines that are
> > > combined, that would work.  
> > 
> > Xe does this is definitely works. We submit to multiple rings, when all
> > rings signal a seqno, a chain or array signals -> finished fence
> > signals. The queues used in this manor can only submit multiple ring
> > jobs so the finished fence timeline stays intact. If you could a
> > multiple rings followed by a single ring submission on the same queue,
> > yes this could break.
> 
> Okay, I had the same understanding, thanks for confirming.
> 

I think the last three comments are resolved here—it’s a queue timeline.
As long as the queue has consistent rules (i.e., submits to a consistent
set of rings), this whole approach makes sense?

> > 
> > >   
> > > > One example is when a single job submits work to multiple rings
> > > > that are flipped in hardware at the same time.  
> > > 
> > > We do have that in Panthor, but that's all explicit: in a single
> > > SUBMIT, you can have multiple jobs targeting different queues, each of
> > > them having their own set of deps/signal ops. The combination of all the
> > > signal ops into a container is left to the UMD. It could be automated
> > > kernel side, but that would be a flag on the SIGNAL op leading to the
> > > creation of a fence_array containing fences from multiple submitted
> > > jobs, rather than the driver combining stuff in the fence it returns in
> > > ::run_job().  
> > 
> > See above. We have a dedicated queue type for these type of submissions
> > and single job that submits to the all rings. We had multiple queue /
> > jobs in the i915 to implemented this but it turns out it is much cleaner
> > with a single queue / singler job / multiple rings model.
> 
> Hm, okay. It didn't turn into a mess in Panthor, but Xe is likely an
> order of magnitude more complicated that Mali, so I'll refrain from
> judging this design decision.
> 

Yes, Xe is a beast, but we tend to build complexity into components and
layers to manage it. That is what I’m attempting to do here.

> > 
> > >   
> > > > 
> > > > Another case is late arming of hardware fences in run_job (which many
> > > > drivers do). The proxy fence is immediately available at arm time and
> > > > can be installed into dma-resv or syncobjs even though the actual
> > > > hardware fence is not yet available. I think most drivers could be
> > > > refactored to make the hardware fence immediately available at run_job,
> > > > though.  
> > > 
> > > Yep, I also think we can arm the driver fence early in the case of
> > > JobQueue. The reason it couldn't be done before is because the
> > > scheduler was in the middle, deciding which entity to pull the next job
> > > from, which was changing the seqno a job driver-fence would be assigned
> > > (you can't guess that at queue time in that case).
> > >   
> > 
> > Xe doesn't need to late arming, but it look like multiple drivers to
> > implement the late arming which may be required (?).
> 
> As I said, it's mostly a problem when you have a
> single-HW-queue:multiple-contexts model, which is exactly what
> drm_sched was designed for. I suspect early arming is not an issue for
> any of the HW supporting FW-based scheduling (PVR, Mali, NVidia,
> ...). If you want to use drm_dep for all drivers currently using
> drm_sched (I'm still not convinced this is a good idea to do that
> just yet, because then you're going to pull a lot of the complexity
> we're trying to get rid of), then you need late arming of driver fences.
> 

Yes, even the hardware scheduling component [1] I hacked together relied
on no late arming. But even then, you can arm a dma-fence early and
assign a hardware seqno later in run_job()—those are two different
things.

[1] 
https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svn-perf-6-15-2025/-/commit/22c8aa993b5c9e4ad0c312af2f3e032273d20966#line_7c49af3ee_A319

> > 
> > > [...]
> > >   
> > > > > > > + * **Reference counting**
> > > > > > > + *
> > > > > > > + * Jobs and queues are both reference counted.
> > > > > > > + *
> > > > > > > + * A job holds a reference to its queue from drm_dep_job_init() 
> > > > > > > until
> > > > > > > + * drm_dep_job_put() drops the job's last reference and its 
> > > > > > > release callback
> > > > > > > + * runs. This ensures the queue remains valid for the entire 
> > > > > > > lifetime of any
> > > > > > > + * job that was submitted to it.
> > > > > > > + *
> > > > > > > + * The queue holds its own reference to a job for as long as the 
> > > > > > > job is
> > > > > > > + * internally tracked: from the moment the job is added to the 
> > > > > > > pending list
> > > > > > > + * in drm_dep_queue_run_job() until drm_dep_job_done() kicks the 
> > > > > > > put_job
> > > > > > > + * worker, which calls drm_dep_job_put() to release that 
> > > > > > > reference.    
> > > > > > 
> > > > > > Why not simply keep track that the job was completed, instead of 
> > > > > > relinquishing
> > > > > > the reference? We can then release the reference once the job is 
> > > > > > cleaned up
> > > > > > (by the queue, using a worker) in process context.    
> > > > 
> > > > I think that’s what I’m doing, while also allowing an opt-in path to
> > > > drop the job reference when it signals (in IRQ context)  
> > > 
> > > Did you mean in !IRQ (or !atomic) context here? Feels weird to not
> > > defer the cleanup when you're in an IRQ/atomic context, but defer it
> > > when you're in a thread context.
> > >   
> > 
> > The put of a job in this design can be from an IRQ context (opt-in)
> > feature. xa_destroy blows up if it is called from an IRQ context,
> > although maybe that could be workaround.
> 
> Making it so _put() in IRQ context is safe is fine, what I'm saying is
> that instead of doing a partial immediate cleanup, and the rest in a
> worker, we can just defer everything: that is, have some
> _deref_release() function called by kref_put() that would queue a work
> item from which the actual release is done.
> 

See below.

> > 
> > > > so we avoid
> > > > switching to a work item just to drop a ref. That seems like a
> > > > significant win in terms of CPU cycles.  
> > > 
> > > Well, the cleanup path is probably not where latency matters the most.  
> > 
> > Agree. But I do think avoiding a CPU context switch (work item) for a
> > very lightweight job cleanup (usually just drop refs) will save of CPU
> > cycles, thus also things like power, etc...
> 
> That's the sort of statements I'd like to be backed by actual
> numbers/scenarios proving that it actually makes a difference. The

I disagree. This is not a locking micro-optimization, for example. It is
a software architecture choice that says “do not trigger a CPU context
to free a job,” which costs thousands of cycles. This will have an
effect on CPU utilization and, thus, power.

> mixed model where things are partially freed immediately/partially
> deferred, and sometimes even with conditionals for whether the deferral
> happens or not, it just makes building a mental model of this thing a
> nightmare, which in turn usually leads to subtle bugs.
> 

See above—managing complexity in components. This works in both modes. I
refactored Xe so it also works in IRQ context. If it would make you feel
better, I can ask my company commits CI resources so non-IRQ mode
consistently works too—it’s just a single API flag on the queue. But
then maybe other companies should also commit to public CI.

> > 
> > > It's adding scheduling overhead, sure, but given all the stuff we defer
> > > already, I'm not too sure we're at saving a few cycles to get the
> > > cleanup done immediately. What's important to have is a way to signal
> > > fences in an atomic context, because this has an impact on latency.
> > >   
> > 
> > Yes. The signaling happens first then drm_dep_job_put if IRQ opt-in.
> > 
> > > [...]
> > >   
> > > > > > > + /*
> > > > > > > + * Drop all input dependency fences now, in process context, 
> > > > > > > before the
> > > > > > > + * final job put. Once the job is on the pending list its last 
> > > > > > > reference
> > > > > > > + * may be dropped from a dma_fence callback (IRQ context), where 
> > > > > > > calling
> > > > > > > + * xa_destroy() would be unsafe.
> > > > > > > + */    
> > > > > > 
> > > > > > I assume that “pending” is the list of jobs that have been handed 
> > > > > > to the driver
> > > > > > via ops->run_job()?
> > > > > > 
> > > > > > Can’t this problem be solved by not doing anything inside a 
> > > > > > dma_fence callback
> > > > > > other than scheduling the queue worker?
> > > > > >     
> > > > 
> > > > Yes, this code is required to support dropping job refs directly in the
> > > > dma-fence callback (an opt-in feature). Again, this seems like a
> > > > significant win in terms of CPU cycles, although I haven’t collected
> > > > data yet.  
> > > 
> > > If it significantly hurts the perf, I'd like to understand why, because
> > > to me it looks like pure-cleanup (no signaling involved), and thus no
> > > other process waiting for us to do the cleanup. The only thing that
> > > might have an impact is how fast you release the resources, and given
> > > it's only a partial cleanup (xa_destroy() still has to be deferred), I'd
> > > like to understand which part of the immediate cleanup is causing a
> > > contention (basically which kind of resources the system is starving of)
> > >   
> > 
> > It was more of once we moved to a ref counted model, it is pretty
> > trivial allow drm_dep_job_put when the fence is signaling. It doesn't
> > really add any complexity either, thus why I added it is.
> 
> It's not the refcount model I'm complaining about, it's the "part of it
> is always freed immediately, part of it is deferred, but not always ..."
> that happens in drm_dep_job_release() I'm questioning. I'd really
> prefer something like:
> 

You are completely missing the point here.

Here is what I’ve reduced my job put to:

188         xe_sched_job_free_fences(job);
189         dma_fence_put(job->fence);
190         job_free(job);
191         atomic_dec(&q->job_cnt);
192         xe_pm_runtime_put(xe);

These are lightweight (IRQ-safe) operations that never need to be done
in a work item—so why kick one?

Matt

> static void drm_dep_job_release()
> {
>       // do it all unconditionally
> }
> 
> static void drm_dep_job_defer_release()
> {
>       queue_work(&job->cleanup_work);
> }
> 
> static void drm_dep_job_put()
> {
>       kref_put(job, drm_dep_job_defer_release);
> }

Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer

Reply via email to