On Mon, Mar 16, 2026 at 10:16:01AM +0100, Boris Brezillon wrote: > Hi Matthew, > > On Sun, 15 Mar 2026 21:32:45 -0700 > Matthew Brost <[email protected]> wrote: > > > Diverging requirements between GPU drivers using firmware scheduling > > and those using hardware scheduling have shown that drm_gpu_scheduler is > > no longer sufficient for firmware-scheduled GPU drivers. The technical > > debt, lack of memory-safety guarantees, absence of clear object-lifetime > > rules, and numerous driver-specific hacks have rendered > > drm_gpu_scheduler unmaintainable. It is time for a fresh design for > > firmware-scheduled GPU drivers—one that addresses all of the > > aforementioned shortcomings. > > > > Add drm_dep, a lightweight GPU submission queue intended as a > > replacement for drm_gpu_scheduler for firmware-managed GPU schedulers > > (e.g. Xe, Panthor, AMDXDNA, PVR, Nouveau, Nova). Unlike > > drm_gpu_scheduler, which separates the scheduler (drm_gpu_scheduler) > > from the queue (drm_sched_entity) into two objects requiring external > > coordination, drm_dep merges both roles into a single struct > > drm_dep_queue. This eliminates the N:1 entity-to-scheduler mapping > > that is unnecessary for firmware schedulers which manage their own > > run-lists internally. > > > > Unlike drm_gpu_scheduler, which relies on external locking and lifetime > > management by the driver, drm_dep uses reference counting (kref) on both > > queues and jobs to guarantee object lifetime safety. A job holds a queue > > reference from init until its last put, and the queue holds a job reference > > from dispatch until the put_job worker runs. This makes use-after-free > > impossible even when completion arrives from IRQ context or concurrent > > teardown is in flight. > > > > The core objects are: > > > > struct drm_dep_queue - a per-context submission queue owning an > > ordered submit workqueue, a TDR timeout workqueue, an SPSC job > > queue, and a pending-job list. Reference counted; drivers can embed > > it and provide a .release vfunc for RCU-safe teardown. > > First of, I like this idea, and actually think we should have done that > from the start rather than trying to bend drm_sched to meet our
Yes. Tvrtko actually suggested this years ago, and in my naïveté I rejected it. I’m eating my hat here. > FW-assisted scheduling model. That's also the direction me and Danilo > have been pushing for for the new JobQueue stuff in rust, so I'm glad > to see some consensus here. > > Now, let's start with the usual naming nitpick :D => can't we find a > better prefix than "drm_dep"? I think I get where "dep" comes from (the > logic mostly takes care of job deps, and acts as a FIFO otherwise, no > real scheduling involved). It's kinda okay for drm_dep_queue, even > though, according to the description you've made, jobs seem to stay in > that queue even after their deps are met, which, IMHO, is a bit > confusing: dep_queue sounds like a queue in which jobs are placed until > their deps are met, and then the job moves to some other queue. > > It gets worse for drm_dep_job, which sounds like a dep-only job, rather > than a job that's queued to the drm_dep_queue. Same goes for > drm_dep_fence, which I find super confusing. What this one does is just > proxy the driver fence to provide proper isolation between GPU drivers > and fence observers (other drivers). > > Since this new model is primarily designed for hardware that have > FW-assisted scheduling, how about drm_fw_queue, drm_fw_job, > drm_fw_job_fence? We can bikeshed — I’m open to other names, but I believe hardware scheduling can be built quite cleanly on top of this, so drm_fw_* doesn’t really work either. Check out a hardware-scheduler PoC built (today) on top of this in [1]. [1] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svn-perf-6-15-2025/-/commit/22c8aa993b5c9e4ad0c312af2f3e032273d20966 > > > > > struct drm_dep_job - a single unit of GPU work. Drivers embed this > > and provide a .release vfunc. Jobs carry an xarray of input > > dma_fence dependencies and produce a drm_dep_fence as their > > finished fence. > > > > struct drm_dep_fence - a dma_fence subclass wrapping an optional > > parent hardware fence. The finished fence is armed (sequence > > number assigned) before submission and signals when the hardware > > fence signals (or immediately on synchronous completion). > > > > Job lifecycle: > > 1. drm_dep_job_init() - allocate and initialise; job acquires a > > queue reference. > > 2. drm_dep_job_add_dependency() and friends - register input fences; > > duplicates from the same context are deduplicated. > > 3. drm_dep_job_arm() - assign sequence number, obtain finished fence. > > 4. drm_dep_job_push() - submit to queue. > > > > Submission paths under queue lock: > > - Bypass path: if DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set, the > > SPSC queue is empty, no dependencies are pending, and credits are > > available, the job is dispatched inline on the calling thread. > > I've yet to look at the code, but I must admit I'm less worried about > this fast path if it's part of a new model restricted to FW-assisted > scheduling. I keep thinking we're not entirely covered for so called > real-time GPU contexts that might have jobs that are not dep-free, and > if we're going for something new, I'd really like us to consider that > case from the start (maybe investigate if kthread_work[er] can be used > as a replacement for workqueues, if RT priority on workqueues is not an > option). > I mostly agree, and I’ll look into whether kthread_work is better suited—if that’s the right model, it should be done up front. But can you give a use case for real-time GPU contexts that are not dep-free? I personally don’t know of one. > > - Queued path: job is pushed onto the SPSC queue and the run_job > > worker is kicked. The worker resolves remaining dependencies > > (installing wakeup callbacks for unresolved fences) before calling > > ops->run_job(). > > > > Credit-based throttling prevents hardware overflow: each job declares > > a credit cost at init time; dispatch is deferred until sufficient > > credits are available. > > > > Timeout Detection and Recovery (TDR): a per-queue delayed work item > > fires when the head pending job exceeds q->job.timeout jiffies, calling > > ops->timedout_job(). drm_dep_queue_trigger_timeout() forces immediate > > expiry for device teardown. > > > > IRQ-safe completion: queues flagged DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE > > allow drm_dep_job_done() to be called from hardirq context (e.g. a > > dma_fence callback). Dependency cleanup is deferred to process context > > after ops->run_job() returns to avoid calling xa_destroy() from IRQ. > > > > Zombie-state guard: workers use kref_get_unless_zero() on entry and > > bail immediately if the queue refcount has already reached zero and > > async teardown is in flight, preventing use-after-free. > > > > Teardown is always deferred to a module-private workqueue (dep_free_wq) > > so that destroy_workqueue() is never called from within one of the > > queue's own workers. Each queue holds a drm_dev_get() reference on its > > owning struct drm_device, released as the final step of teardown via > > drm_dev_put(). This prevents the driver module from being unloaded > > while any queue is still alive without requiring a separate drain API. > > Thanks for posting this RFC. I'll try to have a closer look at the code > in the coming days, but given the diffstat, it might take me a bit of > time... I understand — I’m a firehose when I get started. Hopefully a sane one, though. Matt > > Regards, > > Boris
