Hi all, On Tue, 30 Sep 2025 11:00:00 +0200 Philipp Stanner <[email protected]> wrote:
> +Cc Sima, Dave > > On Mon, 2025-09-29 at 16:07 +0200, Danilo Krummrich wrote: > > On Wed Sep 3, 2025 at 5:23 PM CEST, Tvrtko Ursulin wrote: > > > This is another respin of this old work^1 which since v7 is a total > > > rewrite and > > > completely changes how the control is done. > > > > I only got some of the patches of the series, can you please send all of > > them > > for subsequent submissions? You may also want to consider resending if > > you're > > not getting a lot of feedback due to that. :) > > > > > On the userspace interface side of things it is the same as before. We > > > have > > > drm.weight as an interface, taking integers from 1 to 10000, the same as > > > CPU and > > > IO cgroup controllers. > > > > In general, I think it would be good to get GPU vendors to speak up to what > > kind > > of interfaces they're heading to with firmware schedulers and potential > > firmware > > APIs to control scheduling; especially given that this will be a uAPI. > > > > (Adding a couple of folks to Cc.) > > > > Having that said, I think the basic drm.weight interface is fine and should > > work > > in any case; i.e. with the existing DRM GPU scheduler in both modes, the > > upcoming DRM Jobqueue efforts and should be generic enough to work with > > potential firmware interfaces we may see in the future. > > > > Philipp should be talking about the DRM Jobqueue component at XDC (probably > > just > > in this moment). > > > > -- > > > > Some more thoughts on the DRM Jobqueue and scheduling: > > > > The idea behind the DRM Jobqueue is to be, as the name suggests, a component > > that receives jobs from userspace, handles the dependencies (i.e. dma > > fences), > > and executes the job, e.g. by writing to a firmware managed software ring. > > > > It basically does what the GPU scheduler does in 1:1 entity-scheduler mode, > > just without all the additional complexity of moving job ownership from one > > component to another (i.e. from entity to scheduler, etc.). > > > > With just that, there is no scheduling outside the GPU's firmware scheduler > > of > > course. However, additional scheduler capabilities, e.g. to support hardware > > rings, or manage firmware schedulers that only support a limited number of > > software rings (like some Mali GPUs), can be layered on top of that: > > > > In contrast to the existing GPU scheduler, the idea would be to keep > > letting the > > DRM Jobqueue handle jobs submitted by userspace from end to end (i.e. let > > the > > push to the hardware (or software) ring buffer), but have an additional > > component, whose only purpose is to orchestrate the DRM Jobqueues, by > > managing > > when they are allowed to push to a ring and which ring they should push to. > > > > This way we get rid of one of the issue that the existing GPU scheduler > > moves > > job ownership between components of different lifetimes (entity and > > scheduler), > > which is one of the fundamental hassles to deal with. > > > So just a few minutes ago I had a long chat with Sima. > > Sima (and I, too, I think) thinks that the very few GPUs that have a > reasonably low limit of firmware rings should just resource-limit > userspace users once the limit of firmware rings is reached. > > Basically like with VRAM. > > Apparently Sima had suggested that to Panthor in the past? But Panthor > still seems to have implemented yet another scheduler mechanism on top > of the 1:1 entity-scheduler drm_sched setup? > > @Boris: Why was that done? So, the primary reason was that the layer of scheduling we have doesn't operate at the job or queue level, but at an higher level called group, which is basically a collection of queues that have close interactions (a group is backing a VkQueue, and in Mali, a VkQueue has a vertex suqueue, a fragment subqueue and a compute subqueue). There's also some fairness involved in our scheduling, where we rotate the priority of groups over time so it's not always the same group that gets to execute its workload. I tried to build a mental model of Sima's suggestion at the time, but I never got to reconcile the job level scheduling (forcing a limit on the amount of jobs that can be queued per-subqueue) with the group level scheduling here, and it also didn't seem like having this extra layer of scheduling was a big deal, because ultimately, it doesn't get in the way of the single-entity scheduling provided by drm_sched, it's just something on top. The other reason being that, even if we find a way to reconcile the two scheduling models (job vs group) based on some resource-limiting algorithm, it would get in the way of usermode queues, because then the job delimitation is blurry. Indeed, in that case you no longer manipulate jobs, but execution contexts, that have to be scheduled in/out to introduce some kind of fairness, at which point the resource becomes GPU time, and you're back to the timeslice-based scheduling we have right now. > > So far I tend to prefer Sima's proposal because I'm currently very > unsure how we could deal with shared firmware rings – because then we'd > need to resubmit jobs, and the currently intended Rust ownership model > would then be at danger, because the Jobqueue would need a: > pending_list. So, my take on that is that what we want ultimately is to have the functionality provided by drm_sched split into different components that can be used in isolation, or combined to provide advanced scheduling. JobQueue: - allows you to queue jobs with their deps - dequeues jobs once their deps are met Not too sure if we want a push or a pull model for the job dequeuing, but the idea is that once the job is dequeued, ownership is passed to the SW entity that dequeued it. Note that I intentionally didn't add the timeout handling here, because dequeueing a job doesn't necessarily mean it's started immediately. If you're dealing with HW queues, you might have to wait for a slot to become available. If you're dealing with something like Mali-CSF, where the amount of FW slots is limited, you want to wait for your execution context to be passed to the FW for scheduling, and the final situation is the full-fledged FW scheduling, where you want things to start as soon as you have space in your FW queue (AKA ring-buffer?). JobHWDispatcher: (not sure about the name, I'm bad at naming things) This object basically pulls ready-jobs from one or multiple JobQueues into its own queue, and wait for a HW slot to become available. If you go for the push model, the job gets pushed to the HW dispatcher queue and waits here until a HW slot becomes available. That's where timeouts should be handled, because the job only becomes active when it gets pushed to a HW slot. I guess if we want a resubmit mechanism, it would have to take place here, but give how tricky this has been, I'd be tempted to leave that to drivers, that is, let them requeue the non-faulty jobs directly to their JobHWDispatcher implementation after a reset. FWExecutionContextScheduler: (again, pick a different name if you want) This scheduler doesn't know about jobs, meaning there's a driver-specific entity that needs to dequeue jobs from the JobQueue and push those to the relevant ringbuffer. Once a FWExecutionContext has something to execute, it becomes a candidate for FWExecutionContextScheduler, which gets to decide which set of FWExecutionContext get a chance to be scheduled by the FW. That one is for Mali-CSF case I described above, and I'm not too sure we want it to be generic, at least not until we have another GPU driver needing the same kind of scheduling. Again, you want to defer the timeout handling to this component, because the timer should only start/resume when the FWExecutionContext gets scheduled, and it should be paused as soon as the context gets evicted. TLDR; I think the main problem we had with drm_sched is that it had this clear drm_sched_entity/drm_gpu_scheduler separation, but those two components where tightly tied together, with no way to use drm_sched_entity alone for instance, and this led to the weird lifetime/ownership issues that the rust effort made more apparent. If we get to design something new, I think we should try hard to get a clear isolation between each of these components so they can be used alone or combined, with a clear job ownership model. Regards, Boris
