On Thu, Mar 19, 2026 at 10:57:29AM +0100, Boris Brezillon wrote: > On Wed, 18 Mar 2026 15:40:35 -0700 > Matthew Brost <[email protected]> wrote: > > > > > > > > > So I don’t think Rust natively solves these types of problems, although > > > > I’ll concede that it does make refcounting a bit more sane. > > > > > > Rust won't magically defer the cleanup, nor will it dictate how you want > > > to do the queue teardown, those are things you need to implement. But it > > > should give visibility about object lifetimes, and guarantee that an > > > object that's still visible to some owners is usable (the notion of > > > usable is highly dependent on the object implementation). > > > > > > Just a purely theoretical example of a multi-step queue teardown that > > > might be possible to encode in rust: > > > > > > - MyJobQueue<Usable>: The job queue is currently exposed and usable. > > > There's a ::destroy() method consuming 'self' and returning a > > > MyJobQueue<Destroyed> object > > > - MyJobQueue<Destroyed>: The user asked for the workqueue to be > > > destroyed. No new job can be pushed. Existing jobs that didn't make > > > it to the FW queue are cancelled, jobs that are in-flight are > > > cancelled if they can, or are just waited upon if they can't. When > > > the whole destruction step is done, ::destroyed() is called, it > > > consumes 'self' and returns a MyJobQueue<Inactive> object. > > > - MyJobQueue<Inactive>: The queue is no longer active (HW doesn't have > > > any resources on this queue). It's ready to be cleaned up. > > > ::cleanup() (or just ::drop()) defers the cleanup of some inner > > > object that has been passed around between the various > > > MyJobQueue<State> wrappers. > > > > > > Each of the state transition can happen asynchronously. A state > > > transition consumes the object in one state, and returns a new object > > > in its new state. None of the transition involves dropping a refcnt, > > > ownership is just transferred. The final MyJobQueue<Inactive> object is > > > the object we'll defer cleanup on. > > > > > > It's a very high-level view of one way this can be implemented (I'm > > > sure there are others, probably better than my suggestion) in order to > > > make sure the object doesn't go away without the compiler enforcing > > > proper state transitions. > > > > > > > I'm sure Rust can implement this. My point about Rust is it doesn't > > magically solve hard software arch probles, but I will admit the > > ownership model, way it can enforce locking at compile time is pretty > > cool. > > It's not quite about rust directly solving those problems for you, it's > about rust forcing you to think about those problems in the first > place. So no, rust won't magically solve your multi-step teardown with > crazy CPU <-> Device synchronization etc, but it allows you to clearly > identify those steps, and think about how you want to represent them > without abusing other concepts, like object refcounting/ownership. > Everything I described, you can code it in C BTW, it's just that C is so > lax that you can also abuse other stuff to get to your ends, which might > or might not be safe, but more importantly, will very likely obfuscate > the code (even with good docs). >
This is very well put, and I completely agree. Sorry—I get annoyed by the Rust comments. It solves some classes of problems, but it doesn’t magically solve complex software architecture issues that need to be thoughtfully designed. > > > > > > > > > +/** > > > > > > > + * DOC: DRM dependency fence > > > > > > > + * > > > > > > > + * Each struct drm_dep_job has an associated struct > > > > > > > drm_dep_fence that > > > > > > > + * provides a single dma_fence (@finished) signalled when the > > > > > > > hardware > > > > > > > + * completes the job. > > > > > > > + * > > > > > > > + * The hardware fence returned by &drm_dep_queue_ops.run_job is > > > > > > > stored as > > > > > > > + * @parent. @finished is chained to @parent via > > > > > > > drm_dep_job_done_cb() and > > > > > > > + * is signalled once @parent signals (or immediately if > > > > > > > run_job() returns > > > > > > > + * NULL or an error). > > > > > > > > > > > > I thought this fence proxy mechanism was going away due to recent > > > > > > work being > > > > > > carried out by Christian? > > > > > > > > > > > > > > Consider the case where a driver’s hardware fence is implemented as a > > > > dma-fence-array or dma-fence-chain. You cannot install these types of > > > > fences into a dma-resv or into syncobjs, so a proxy fence is useful > > > > here. > > > > > > Hm, so that's a driver returning a dma_fence_array/chain through > > > ::run_job()? Why would we not want to have them directly exposed and > > > split up into singular fence objects at resv insertion time (I don't > > > think syncobjs care, but I might be wrong). I mean, one of the point > > > > You can stick dma-fence-arrays in syncobjs, but not chains. > > Yeah, kinda makes sense, since timeline syncobjs use chains, and if the > chain reject inner chains, it won't work. > +1, Exactly. > > > > Neither dma-fence-arrays/chain can go into dma-resv. > > They can't go directly in it, but those can be split into individual > fences and be inserted, which would achieve the same goal. > Yes, but now it becomes a driver problem (maybe only mine) rather than an opaque job fence that can be inserted. In my opinion, it’s best to keep the job vs. hardware fence abstraction. > > > > Hence why disconnecting a job's finished fence from hardware fence IMO > > is good idea to keep so gives drivers flexiblity on the hardware fences. > > The thing is, I'm not sure drivers were ever meant to expose containers > through ::run_job(). > Well there haven't been any rules... > > e.g., If this design didn't have a job's finished fence, I'd have to > > open code one Xe side. > > There might be other reasons we'd like to keep the > drm_sched_fence-like proxy that I'm missing. But if it's the only one, > and the fence-combining pattern you're describing is common to multiple > drivers, we can provide a container implementation that's not a > fence_array, so you can use it to insert driver fences into other > containers. This way we wouldn't force the proxy model to all drivers, > but we would keep the code generic/re-usable. > > > > > > behind the container extraction is so fences coming from the same > > > context/timeline can be detected and merged. If you insert the > > > container through a proxy, you're defeating the whole fence merging > > > optimization. > > > > Right. Finished fences have single timeline too... > > Aren't you faking a single timeline though if you combine fences from > different engines running at their own pace into a container? > > > > > > > > > The second thing is that I'm not sure drivers were ever supposed to > > > return fence containers in the first place, because the whole idea > > > behind a fence context is that fences are emitted/signalled in > > > seqno-order, and if the fence is encoding the state of multiple > > > timelines that progress at their own pace, it becomes tricky to control > > > that. I guess if it's always the same set of timelines that are > > > combined, that would work. > > > > Xe does this is definitely works. We submit to multiple rings, when all > > rings signal a seqno, a chain or array signals -> finished fence > > signals. The queues used in this manor can only submit multiple ring > > jobs so the finished fence timeline stays intact. If you could a > > multiple rings followed by a single ring submission on the same queue, > > yes this could break. > > Okay, I had the same understanding, thanks for confirming. > I think the last three comments are resolved here—it’s a queue timeline. As long as the queue has consistent rules (i.e., submits to a consistent set of rings), this whole approach makes sense? > > > > > > > > > One example is when a single job submits work to multiple rings > > > > that are flipped in hardware at the same time. > > > > > > We do have that in Panthor, but that's all explicit: in a single > > > SUBMIT, you can have multiple jobs targeting different queues, each of > > > them having their own set of deps/signal ops. The combination of all the > > > signal ops into a container is left to the UMD. It could be automated > > > kernel side, but that would be a flag on the SIGNAL op leading to the > > > creation of a fence_array containing fences from multiple submitted > > > jobs, rather than the driver combining stuff in the fence it returns in > > > ::run_job(). > > > > See above. We have a dedicated queue type for these type of submissions > > and single job that submits to the all rings. We had multiple queue / > > jobs in the i915 to implemented this but it turns out it is much cleaner > > with a single queue / singler job / multiple rings model. > > Hm, okay. It didn't turn into a mess in Panthor, but Xe is likely an > order of magnitude more complicated that Mali, so I'll refrain from > judging this design decision. > Yes, Xe is a beast, but we tend to build complexity into components and layers to manage it. That is what I’m attempting to do here. > > > > > > > > > > > > > Another case is late arming of hardware fences in run_job (which many > > > > drivers do). The proxy fence is immediately available at arm time and > > > > can be installed into dma-resv or syncobjs even though the actual > > > > hardware fence is not yet available. I think most drivers could be > > > > refactored to make the hardware fence immediately available at run_job, > > > > though. > > > > > > Yep, I also think we can arm the driver fence early in the case of > > > JobQueue. The reason it couldn't be done before is because the > > > scheduler was in the middle, deciding which entity to pull the next job > > > from, which was changing the seqno a job driver-fence would be assigned > > > (you can't guess that at queue time in that case). > > > > > > > Xe doesn't need to late arming, but it look like multiple drivers to > > implement the late arming which may be required (?). > > As I said, it's mostly a problem when you have a > single-HW-queue:multiple-contexts model, which is exactly what > drm_sched was designed for. I suspect early arming is not an issue for > any of the HW supporting FW-based scheduling (PVR, Mali, NVidia, > ...). If you want to use drm_dep for all drivers currently using > drm_sched (I'm still not convinced this is a good idea to do that > just yet, because then you're going to pull a lot of the complexity > we're trying to get rid of), then you need late arming of driver fences. > Yes, even the hardware scheduling component [1] I hacked together relied on no late arming. But even then, you can arm a dma-fence early and assign a hardware seqno later in run_job()—those are two different things. [1] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svn-perf-6-15-2025/-/commit/22c8aa993b5c9e4ad0c312af2f3e032273d20966#line_7c49af3ee_A319 > > > > > [...] > > > > > > > > > > + * **Reference counting** > > > > > > > + * > > > > > > > + * Jobs and queues are both reference counted. > > > > > > > + * > > > > > > > + * A job holds a reference to its queue from drm_dep_job_init() > > > > > > > until > > > > > > > + * drm_dep_job_put() drops the job's last reference and its > > > > > > > release callback > > > > > > > + * runs. This ensures the queue remains valid for the entire > > > > > > > lifetime of any > > > > > > > + * job that was submitted to it. > > > > > > > + * > > > > > > > + * The queue holds its own reference to a job for as long as the > > > > > > > job is > > > > > > > + * internally tracked: from the moment the job is added to the > > > > > > > pending list > > > > > > > + * in drm_dep_queue_run_job() until drm_dep_job_done() kicks the > > > > > > > put_job > > > > > > > + * worker, which calls drm_dep_job_put() to release that > > > > > > > reference. > > > > > > > > > > > > Why not simply keep track that the job was completed, instead of > > > > > > relinquishing > > > > > > the reference? We can then release the reference once the job is > > > > > > cleaned up > > > > > > (by the queue, using a worker) in process context. > > > > > > > > I think that’s what I’m doing, while also allowing an opt-in path to > > > > drop the job reference when it signals (in IRQ context) > > > > > > Did you mean in !IRQ (or !atomic) context here? Feels weird to not > > > defer the cleanup when you're in an IRQ/atomic context, but defer it > > > when you're in a thread context. > > > > > > > The put of a job in this design can be from an IRQ context (opt-in) > > feature. xa_destroy blows up if it is called from an IRQ context, > > although maybe that could be workaround. > > Making it so _put() in IRQ context is safe is fine, what I'm saying is > that instead of doing a partial immediate cleanup, and the rest in a > worker, we can just defer everything: that is, have some > _deref_release() function called by kref_put() that would queue a work > item from which the actual release is done. > See below. > > > > > > so we avoid > > > > switching to a work item just to drop a ref. That seems like a > > > > significant win in terms of CPU cycles. > > > > > > Well, the cleanup path is probably not where latency matters the most. > > > > Agree. But I do think avoiding a CPU context switch (work item) for a > > very lightweight job cleanup (usually just drop refs) will save of CPU > > cycles, thus also things like power, etc... > > That's the sort of statements I'd like to be backed by actual > numbers/scenarios proving that it actually makes a difference. The I disagree. This is not a locking micro-optimization, for example. It is a software architecture choice that says “do not trigger a CPU context to free a job,” which costs thousands of cycles. This will have an effect on CPU utilization and, thus, power. > mixed model where things are partially freed immediately/partially > deferred, and sometimes even with conditionals for whether the deferral > happens or not, it just makes building a mental model of this thing a > nightmare, which in turn usually leads to subtle bugs. > See above—managing complexity in components. This works in both modes. I refactored Xe so it also works in IRQ context. If it would make you feel better, I can ask my company commits CI resources so non-IRQ mode consistently works too—it’s just a single API flag on the queue. But then maybe other companies should also commit to public CI. > > > > > It's adding scheduling overhead, sure, but given all the stuff we defer > > > already, I'm not too sure we're at saving a few cycles to get the > > > cleanup done immediately. What's important to have is a way to signal > > > fences in an atomic context, because this has an impact on latency. > > > > > > > Yes. The signaling happens first then drm_dep_job_put if IRQ opt-in. > > > > > [...] > > > > > > > > > > + /* > > > > > > > + * Drop all input dependency fences now, in process context, > > > > > > > before the > > > > > > > + * final job put. Once the job is on the pending list its last > > > > > > > reference > > > > > > > + * may be dropped from a dma_fence callback (IRQ context), where > > > > > > > calling > > > > > > > + * xa_destroy() would be unsafe. > > > > > > > + */ > > > > > > > > > > > > I assume that “pending” is the list of jobs that have been handed > > > > > > to the driver > > > > > > via ops->run_job()? > > > > > > > > > > > > Can’t this problem be solved by not doing anything inside a > > > > > > dma_fence callback > > > > > > other than scheduling the queue worker? > > > > > > > > > > > > > > Yes, this code is required to support dropping job refs directly in the > > > > dma-fence callback (an opt-in feature). Again, this seems like a > > > > significant win in terms of CPU cycles, although I haven’t collected > > > > data yet. > > > > > > If it significantly hurts the perf, I'd like to understand why, because > > > to me it looks like pure-cleanup (no signaling involved), and thus no > > > other process waiting for us to do the cleanup. The only thing that > > > might have an impact is how fast you release the resources, and given > > > it's only a partial cleanup (xa_destroy() still has to be deferred), I'd > > > like to understand which part of the immediate cleanup is causing a > > > contention (basically which kind of resources the system is starving of) > > > > > > > It was more of once we moved to a ref counted model, it is pretty > > trivial allow drm_dep_job_put when the fence is signaling. It doesn't > > really add any complexity either, thus why I added it is. > > It's not the refcount model I'm complaining about, it's the "part of it > is always freed immediately, part of it is deferred, but not always ..." > that happens in drm_dep_job_release() I'm questioning. I'd really > prefer something like: > You are completely missing the point here. Here is what I’ve reduced my job put to: 188 xe_sched_job_free_fences(job); 189 dma_fence_put(job->fence); 190 job_free(job); 191 atomic_dec(&q->job_cnt); 192 xe_pm_runtime_put(xe); These are lightweight (IRQ-safe) operations that never need to be done in a work item—so why kick one? Matt > static void drm_dep_job_release() > { > // do it all unconditionally > } > > static void drm_dep_job_defer_release() > { > queue_work(&job->cleanup_work); > } > > static void drm_dep_job_put() > { > kref_put(job, drm_dep_job_defer_release); > }
